Resources for Turkish NLP: A List of Turkish NLP Resources

The following is an unorganized list of resources for Turkish natural language processing. The survey paper that resulted in this list can be found here. If you want a particular resource to be included in this list, please fill this form. Alternatively, you can create an issue in the GitHub repository for suggesting new resources or reporting inaccuracies (pull requests are also welcome).

Multiple resources from Yıldız NLP group (Kemik)
A collection of resources by Deniz Yüret
Ottoman (local) newspapers
ODIN: A multi-lingual corpus of linguistic examples with glosses
A named-entity data set (from Bilkent)
A Twitter NER data set
A NER data from METU
A NER data from BOUN
A NER data from İTÜ
A NER data set from Huawei research
TurkIE: Another NER data set (broken link)
WikiAnn: A multilingual name tagging/linking data based on Wikipedia
Another Twitter NER data set from Sabancı University
TR-SEQ: A named-entity data set of search engine queries
A data set for NER and Stance detection
XNLI: Cross-lingual natural language inference data including Turkish
NLI-TR: Turkish (automatic) translations of SNLI and MultiNLI
Zemberek: A Java NLP toolkit
Multilingual ATIS
xSID: A multi-lingual intent and slot classification data
A Turkish PropBank
TrPropBank: Another Turkish PropBank
A Word-sense disambiguation data
WSD data set from SemEval-2007 Task 12
Another WSD data set
SUTAV: A audio-visual database
Gender identification on Twitter
CHILDES/Aksu: Turkish data in CHILDES
CHILDES/Atınkamış: Turkish data in CHILDES
A Turkish-Dutch code-switching data set
A Turkish-German code-switching data set (Twitter)
A Turkish-German code-switching data set (spoken)
A Turkish-English code-switching data set (Twitter)
MULTLIT: A corpus of Turkish-German and Turkish-French bilinguals
RUEG: Another bilingual corpus project including Turkish speakers in Germany
A POS tagged data set of Turkish-German code switching
A code-switching corpus of Turkish-Dutch
A small code-switching corpus of Turkish-English
XCOPA: A multilingual dataset for causal commonsense reasoning
Marmara Turkish Coreference Resolution Corpus
Mega-COV: A large multi-lingual collection tweets (IDs) about COVID-19
GeoCoV19: A multi-lingual collection tweets (IDs) about COVID-19 with geolocation
A corpus of Turkish cyberbullying
A corpus of referring expressions
TELL: Turkish Electronic Living Lexicon
LC-STAR: Multi-lingual speech lexica including Turkish
WikiPron: A multilingual pronunciation dictionary from Wiktionary
TDB: Turkish Discourse Bank
TCL: A lexicon of discourse connectives
TED Multilingual Discourse Bank
TREMO: Emotion Dataset collected through a survey
TURED: Twitter Emotion Dataset (automatically labeled)
TEL: Turkish Emotion Lexicon
NRC-VAD: The NRC Valence, Arousal, and Dominance (English emotion lexicon translated to 100+ languages, including Turkish)
NRC-EmoLex: NRC Word-Emotion Association Lexicon (English emotion lexicon translated to 100+ languages, including Turkish)
NRC-EIL: NRC Emotion Intensity Lexicon (English emotion lexicon translated to 100+ languages, including Turkish)
A corpus for Emotion analysis
A corpus for emotion analysis on Twitter
A list of "emotion words"
TrClaim-19: A data set for check-worthy claim detection
X-FACT: A cross-lingual fact-checking data set including Turkish
Turkish FrameNet: A FrameNet for Turkish
METU Corpus: a general-purpose, balanced corpus
TNC: a general-purpose, balanced corpus
BOUN newspaper corpus
TurCo: A corpus from Dokuz Eylül
TS Corpus: A general-purpose corpus
A project on a historical corpus
A collection of historical newspapers (1928-1942) by Istanbl University Library
METU Turkish Psycholinguistic Database
A spoken _English_ corpus of Turkish learners
TrLex: A lexicon resource including (derivational) morphology
UDer: A (preliminary) multi-lingual resource for derivational morphology
UniMorph: A multi-lingual inflectional lexicon (scraped from wiktionary)
BERTurk: Monolingual BERT model for Turkish
mBERT: Multi-lingual BERT model (incl. Turkish)
XLM-R: Multi-lingual language model
Turkish radiology reports
A dataset for checking gender bias
IronyTR: A dataset for irony detection
Another, earlier, irony dataset (precursor of IronyTR)
TRmorph: A Turkish morphological analyzer
A morphological analyzer from Google:
TRMOR: Another morphological analyzer (SFST)
The first practical morphological analyzer for Turkish
A corpus manually annotated for morphology
A 1M corpus with morphology (disambiguated "semi automatically")
Kaggle old newspapers data for language identification
Leipzig corpora: A multilingual written corpora collection
OSCAR: A multi-lingual collection of web-crawled data
CoNLL-2017: Supplementary data for CoNLL-2017 UD parsing shared task
TrMWELexicon: A multiword expressions lexicon
troff: A corpus of Turkish offensive language
A Twitter offensive language data set with context
Hurtlex: A multilingual lexicon of "words that hurt"
A hate-speech data set from Sabancı University
A hate-speech data set from Aselsan
SETimes: A parallel news corpus of Balkan languages
Turkish/Turkic MT resources in Apertium
Human-translated Arabic-Turkish parallel data
JW300: A parallel corpus of 300 languges (texts from Jehovah’s Witnesses)
MADAR-Turk: A parallel corpus of (dialectal) Arabic - Turkish
The BiaNet corpus (parallel TR-KU-EN)
A morpheme-aligned parallel corpus (Turkish-Uzbek-English)
IWLST 2013: Parallel spoken language corpus
OPUS: Multi-lingual open, parallel corpora
BTEC: Basic Traveling Expression Corpus (multilingual)
A corpus of aligned paraphrases
ParlaMint: Multi-lingual parliamentary corpora
A Corpus of Grand National Assembly of Turkish Parliament’s Transcripts
PanLex: A multi-lingual lexical resource
BabelNet: A multi-lingual lexical resource
TuPC-2016: A Turkish paraphrase corpus
TQuAD: A QA data set on Turkish & Islamic Science and History
A QA data set from BOUN
XQuAD: A (small) multi-lingual corpus of question--answer pairs
MKQA: A multi-lingual QA corpus from Apple
A data set of question-answer pairs
A question answering data set (includes translation from SQuAD)
ConceptNet: A multi-lingual semantic network
A semantically-annotated (based on UCCA) data set of 50 sentences from the METU-Sabancı treebank
SemEval 2017 Task 1: Sentence similarity data set
STSb-TR: A text similarity data set
A translation of SentiStrength
A sentiment analysis data set from İTÜ
A sentiment analysis data set (from Eindhowen)
A sentiment analysis data set scraped from an e-commerce site
TRSAv1: A Twitter sentiment analysis data set
A sentiment analysis data set (from Başkent Uni.)
SemEval2016task5
A sentiment analysis corpus from METU
An automatically labeled large sentiment analysis corpus
A corpus for political SA
Another Twitter corpus for sentiment analysis
A (targeted) Twitter sentiment analysis data set from BOUN
SentiTurkNet: A Turkish sentiment lexicon
A Turkish Movie Reviews corpus
A multi-lingual sentiment lexicon
BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains
Another Turkish Sign Language corpus
AUTSL: Another turkish sign language corpus
İTÜ normalization corpus
A Twitter normalization data
Another Twitter normalization data
Normalization data for Turkis-German code switching
A normalization lexicon / corpus
Multilingual LUNA: A multi-lingual, parallel speech corpus
Broadcast news data from BOUN
Data containing audio from movies and read-speech
MuST-C: A parallel spoken corpus of TED talks
Common Voice: A multi-lingual spoken corpus including Turkish
MediaSpeech: A multi-lingual speech corpus including Turkish
CoVoST 2: A Large-Scale Multilingual Speech-To-Text Translation Corpus
Finite state pronunciation lexicon
Turkish SAMPA encoding standard
VoxLingua107: A multilingual speech dataset extracted from YouTube
ISSAI Tukic speech corpus
TurEV-DB: An emotion-voice corpus (METU)
An emotion-voice corpus (Boğaziçi)
A broadcast speech collection
Orientel: Recordings of Telephone conversations of Turkish Speakers in Germany
OrienTel-Turkish: Turkish Telephone Speech
A call center speech data
Hunspell dictionary for Turkish by the TDD group.
METU microphone speech
CoTY: The Corpus of Turkish Youth Language
GlobalPhone: A Multilingual Text & Speech Database in 20 Languages
Global COE: A corpus of spoken Turkish from the Global COE Program
STC: Spoken Turkish corpus
A corpus of student essays
A summarization corpus
WikiLingua: A multi-lingual corpus of abstractive summarization
MLSUM: A multi-lingual news corpus for abstractive summarization
TTC-4900: A text categorization data set
TTC-3600: A text catogiraztion data set
UD-IMST: the first Turkish treebank (UD version)
UD-GB: A treebank of grammar-book examples
TWT: Turkish Web Treebank by Google
UD-BOUN treebank: Another treebank from BOUN
UD-PUD: Turkish part of Google's parallel treebank
Turkish-Penn-15: A constituency treebank of translation of sentences from the Penn TB
UD-Turkish_Penn: A treebank of translation of short sentences from the Penn TB (UD version)
UD-Turkish_Tourism: A domain-specific treebank
UD-Turkish_KeNet: Syntactically annotated examples from the WordNet KeNet
UD-Turkish_FrameNet: Syntactically annotated examples from the Turkish FrameNet
UD-SAGT: A spoken Turkish-German code-switching treebank
IWT: İTÜ web treebank
A EN-SV-TR parallel treebank (most automatically annotated)
A small UD treebank of old Turkish/Turkic
METU-Sabancı treebank: the first Turkish treebank (original version)
A Turkic web-crawl corpus
trTenTen: A web corpus of Turkish (in Sketch Engine)
Turkic word embeddings
Turkish word vectors and analogical reasoning task
Another set of word2vec vectors
ConceptNet Numberbatch: Word vectors based on ConceptNet
MUSE: Multi lingual embeddings
Turkish WordNet from the BalkaNet project
KeNet: Another Turkish WordNet
KelimetriK: Turkish "wug-word" generator