The following is an unorganized list of resources for Turkish natural language processing. The survey paper that resulted in this list can be found here. If you want a particular resource to be included in this list, please fill this form. Alternatively, you can create an issue in the GitHub repository for suggesting new resources or reporting inaccuracies (pull requests are also welcome).
- Multiple resources from Yıldız NLP group (Kemik)
- A collection of resources by Deniz Yüret
- Ottoman (local) newspapers
- ODIN: A multi-lingual corpus of linguistic examples with glosses
-
A named-entity data set (from Bilkent)
-
A Twitter NER data set
-
A NER data from METU
-
A NER data from BOUN
-
A NER data from İTÜ
-
A NER data set from Huawei research
- TurkIE: Another NER data set (broken link)
-
WikiAnn:
A multilingual name tagging/linking data based on Wikipedia
-
Another Twitter NER data set from Sabancı University
-
TR-SEQ:
A named-entity data set of search engine queries
-
A data set for NER and Stance detection
-
XNLI:
Cross-lingual natural language inference data including Turkish
-
NLI-TR:
Turkish (automatic) translations of SNLI and MultiNLI
-
Zemberek:
A Java NLP toolkit
-
Multilingual ATIS
-
xSID:
A multi-lingual intent and slot classification data
-
A Turkish PropBank
-
TrPropBank:
Another Turkish PropBank
-
A Word-sense disambiguation data
-
WSD data set from SemEval-2007 Task 12
-
Another WSD data set
-
SUTAV:
A audio-visual database
- Gender identification on Twitter
-
CHILDES/Aksu:
Turkish data in CHILDES
-
CHILDES/Atınkamış:
Turkish data in CHILDES
-
A Turkish-Dutch code-switching data set
-
A Turkish-German code-switching data set (Twitter)
-
A Turkish-German code-switching data set (spoken)
-
A Turkish-English code-switching data set (Twitter)
-
MULTLIT:
A corpus of Turkish-German and Turkish-French bilinguals
-
RUEG:
Another bilingual corpus project including Turkish speakers in Germany
-
A POS tagged data set of Turkish-German code switching
-
A code-switching corpus of Turkish-Dutch
-
A small code-switching corpus of Turkish-English
-
XCOPA:
A multilingual dataset for causal commonsense reasoning
-
Marmara Turkish Coreference Resolution Corpus
-
Mega-COV:
A large multi-lingual collection tweets (IDs) about COVID-19
-
GeoCoV19:
A multi-lingual collection tweets (IDs) about COVID-19 with geolocation
-
A corpus of Turkish cyberbullying
-
A corpus of referring expressions
-
TELL:
Turkish Electronic Living Lexicon
-
LC-STAR:
Multi-lingual speech lexica including Turkish
-
WikiPron:
A multilingual pronunciation dictionary from Wiktionary
-
TDB:
Turkish Discourse Bank
-
TCL:
A lexicon of discourse connectives
-
TED Multilingual Discourse Bank
-
TREMO:
Emotion Dataset collected through a survey
-
TURED:
Twitter Emotion Dataset (automatically labeled)
-
TEL:
Turkish Emotion Lexicon
-
NRC-VAD:
The NRC Valence, Arousal, and Dominance (English emotion lexicon translated to 100+ languages, including Turkish)
-
NRC-EmoLex:
NRC Word-Emotion Association Lexicon (English emotion lexicon translated to 100+ languages, including Turkish)
-
NRC-EIL:
NRC Emotion Intensity Lexicon (English emotion lexicon translated to 100+ languages, including Turkish)
-
A corpus for Emotion analysis
-
A corpus for emotion analysis on Twitter
-
A list of "emotion words"
-
TrClaim-19:
A data set for check-worthy claim detection
-
X-FACT:
A cross-lingual fact-checking data set including Turkish
-
Turkish FrameNet:
A FrameNet for Turkish
-
METU Corpus:
a general-purpose, balanced corpus
-
TNC:
a general-purpose, balanced corpus
-
BOUN newspaper corpus
- TurCo: A corpus from Dokuz Eylül
-
TS Corpus:
A general-purpose corpus
- A project on a historical corpus
- A collection of historical newspapers (1928-1942) by Istanbl University Library
-
METU Turkish Psycholinguistic Database
-
A spoken _English_ corpus of Turkish learners
-
TrLex:
A lexicon resource including (derivational) morphology
-
UDer:
A (preliminary) multi-lingual resource for derivational morphology
-
UniMorph:
A multi-lingual inflectional lexicon (scraped from wiktionary)
-
BERTurk:
Monolingual BERT model for Turkish
-
mBERT:
Multi-lingual BERT model (incl. Turkish)
-
XLM-R:
Multi-lingual language model
-
Turkish radiology reports
-
A dataset for checking gender bias
-
IronyTR:
A dataset for irony detection
-
Another, earlier, irony dataset (precursor of IronyTR)
-
TRmorph:
A Turkish morphological analyzer
-
A morphological analyzer from Google:
-
TRMOR:
Another morphological analyzer (SFST)
-
The first practical morphological analyzer for Turkish
-
A corpus manually annotated for morphology
-
A 1M corpus with morphology (disambiguated "semi automatically")
- Kaggle old newspapers data for language identification
-
Leipzig corpora:
A multilingual written corpora collection
-
OSCAR:
A multi-lingual collection of web-crawled data
-
CoNLL-2017:
Supplementary data for CoNLL-2017 UD parsing shared task
-
TrMWELexicon:
A multiword expressions lexicon
-
troff:
A corpus of Turkish offensive language
-
A Twitter offensive language data set with context
-
Hurtlex:
A multilingual lexicon of "words that hurt"
-
A hate-speech data set from Sabancı University
-
A hate-speech data set from Aselsan
-
SETimes:
A parallel news corpus of Balkan languages
-
Turkish/Turkic MT resources in Apertium
- Human-translated Arabic-Turkish parallel data
-
JW300:
A parallel corpus of 300 languges (texts from Jehovah’s Witnesses)
-
MADAR-Turk:
A parallel corpus of (dialectal) Arabic - Turkish
-
The BiaNet corpus (parallel TR-KU-EN)
-
A morpheme-aligned parallel corpus (Turkish-Uzbek-English)
-
IWLST 2013:
Parallel spoken language corpus
-
OPUS:
Multi-lingual open, parallel corpora
-
BTEC:
Basic Traveling Expression Corpus (multilingual)
-
A corpus of aligned paraphrases
-
ParlaMint:
Multi-lingual parliamentary corpora
-
A Corpus of Grand National Assembly of Turkish Parliament’s Transcripts
-
PanLex:
A multi-lingual lexical resource
-
BabelNet:
A multi-lingual lexical resource
-
TuPC-2016:
A Turkish paraphrase corpus
- TQuAD: A QA data set on Turkish & Islamic Science and History
-
A QA data set from BOUN
-
XQuAD:
A (small) multi-lingual corpus of question--answer pairs
-
MKQA:
A multi-lingual QA corpus from Apple
-
A data set of question-answer pairs
-
A question answering data set (includes translation from SQuAD)
-
ConceptNet:
A multi-lingual semantic network
-
A semantically-annotated (based on UCCA) data set of 50 sentences from the METU-Sabancı treebank
-
SemEval 2017 Task 1:
Sentence similarity data set
-
STSb-TR:
A text similarity data set
-
A translation of SentiStrength
-
A sentiment analysis data set from İTÜ
-
A sentiment analysis data set (from Eindhowen)
- A sentiment analysis data set scraped from an e-commerce site
-
TRSAv1:
A Twitter sentiment analysis data set
-
A sentiment analysis data set (from Başkent Uni.)
-
SemEval2016task5
-
A sentiment analysis corpus from METU
-
An automatically labeled large sentiment analysis corpus
-
A corpus for political SA
-
Another Twitter corpus for sentiment analysis
-
A (targeted) Twitter sentiment analysis data set from BOUN
-
SentiTurkNet:
A Turkish sentiment lexicon
-
A Turkish Movie Reviews corpus
-
A multi-lingual sentiment lexicon
-
BosphorusSign:
A Turkish Sign Language Recognition Corpus in Health and Finance Domains
-
Another Turkish Sign Language corpus
-
AUTSL:
Another turkish sign language corpus
-
İTÜ normalization corpus
-
A Twitter normalization data
-
Another Twitter normalization data
-
Normalization data for Turkis-German code switching
-
A normalization lexicon / corpus
-
Multilingual LUNA:
A multi-lingual, parallel speech corpus
-
Broadcast news data from BOUN
-
Data containing audio from movies and read-speech
-
MuST-C:
A parallel spoken corpus of TED talks
-
Common Voice:
A multi-lingual spoken corpus including Turkish
-
MediaSpeech:
A multi-lingual speech corpus including Turkish
-
CoVoST 2:
A Large-Scale Multilingual Speech-To-Text Translation Corpus
-
Finite state pronunciation lexicon
- Turkish SAMPA encoding standard
-
VoxLingua107:
A multilingual speech dataset extracted from YouTube
-
ISSAI Tukic speech corpus
-
TurEV-DB:
An emotion-voice corpus (METU)
-
An emotion-voice corpus (Boğaziçi)
-
A broadcast speech collection
-
Orientel:
Recordings of Telephone conversations of Turkish Speakers in Germany
-
OrienTel-Turkish:
Turkish Telephone Speech
-
A call center speech data
- Hunspell dictionary for Turkish by the TDD group.
-
METU microphone speech
-
CoTY:
The Corpus of Turkish Youth Language
-
GlobalPhone:
A Multilingual Text & Speech Database in 20 Languages
- Global COE: A corpus of spoken Turkish from the Global COE Program
-
STC:
Spoken Turkish corpus
- A corpus of student essays
-
A summarization corpus
-
WikiLingua:
A multi-lingual corpus of abstractive summarization
-
MLSUM:
A multi-lingual news corpus for abstractive summarization
- TTC-4900: A text categorization data set
-
TTC-3600:
A text catogiraztion data set
-
UD-IMST:
the first Turkish treebank (UD version)
-
UD-GB:
A treebank of grammar-book examples
-
TWT:
Turkish Web Treebank by Google
-
UD-BOUN treebank:
Another treebank from BOUN
-
UD-PUD:
Turkish part of Google's parallel treebank
-
Turkish-Penn-15:
A constituency treebank of translation of sentences from the Penn TB
- UD-Turkish_Penn: A treebank of translation of short sentences from the Penn TB (UD version)
- UD-Turkish_Tourism: A domain-specific treebank
- UD-Turkish_KeNet: Syntactically annotated examples from the WordNet KeNet
- UD-Turkish_FrameNet: Syntactically annotated examples from the Turkish FrameNet
- UD-SAGT: A spoken Turkish-German code-switching treebank
-
IWT:
İTÜ web treebank
-
A EN-SV-TR parallel treebank (most automatically annotated)
- A small UD treebank of old Turkish/Turkic
-
METU-Sabancı treebank:
the first Turkish treebank (original version)
-
A Turkic web-crawl corpus
-
trTenTen:
A web corpus of Turkish (in Sketch Engine)
-
Turkic word embeddings
-
Turkish word vectors and analogical reasoning task
-
Another set of word2vec vectors
-
ConceptNet Numberbatch:
Word vectors based on ConceptNet
-
MUSE:
Multi lingual embeddings
-
Turkish WordNet from the BalkaNet project
-
KeNet:
Another Turkish WordNet
-
KelimetriK:
Turkish "wug-word" generator