The following is an unorganized list of resources for Turkish natural language processing. The survey paper that resulted in this list can be found here. If you want a particular resource to be included in this list, please fill this form. Alternatively, you can create an issue in the GitHub repository for suggesting new resources or reporting inaccuracies (pull requests are also welcome).

  1. Multiple resources from Yıldız NLP group (Kemik)
  2. A collection of resources by Deniz Yüret
  3. Ottoman (local) newspapers
  4. ODIN: A multi-lingual corpus of linguistic examples with glosses
  5. A named-entity data set (from Bilkent)
  6. A Twitter NER data set
  7. A NER data from METU
  8. A NER data from BOUN
  9. A NER data from İTÜ
  10. A NER data set from Huawei research
  11. TurkIE: Another NER data set (broken link)
  12. WikiAnn: A multilingual name tagging/linking data based on Wikipedia
  13. Another Twitter NER data set from Sabancı University
  14. TR-SEQ: A named-entity data set of search engine queries
  15. A data set for NER and Stance detection
  16. XNLI: Cross-lingual natural language inference data including Turkish
  17. NLI-TR: Turkish (automatic) translations of SNLI and MultiNLI
  18. Zemberek: A Java NLP toolkit
  19. Multilingual ATIS
  20. xSID: A multi-lingual intent and slot classification data
  21. A Turkish PropBank
  22. TrPropBank: Another Turkish PropBank
  23. A Word-sense disambiguation data
  24. WSD data set from SemEval-2007 Task 12
  25. Another WSD data set
  26. SUTAV: A audio-visual database
  27. Gender identification on Twitter
  28. CHILDES/Aksu: Turkish data in CHILDES
  29. CHILDES/Atınkamış: Turkish data in CHILDES
  30. A Turkish-Dutch code-switching data set
  31. A Turkish-German code-switching data set (Twitter)
  32. A Turkish-German code-switching data set (spoken)
  33. A Turkish-English code-switching data set (Twitter)
  34. MULTLIT: A corpus of Turkish-German and Turkish-French bilinguals
  35. RUEG: Another bilingual corpus project including Turkish speakers in Germany
  36. A POS tagged data set of Turkish-German code switching
  37. A code-switching corpus of Turkish-Dutch
  38. A small code-switching corpus of Turkish-English
  39. XCOPA: A multilingual dataset for causal commonsense reasoning
  40. Marmara Turkish Coreference Resolution Corpus
  41. Mega-COV: A large multi-lingual collection tweets (IDs) about COVID-19
  42. GeoCoV19: A multi-lingual collection tweets (IDs) about COVID-19 with geolocation
  43. A corpus of Turkish cyberbullying
  44. A corpus of referring expressions
  45. TELL: Turkish Electronic Living Lexicon
  46. LC-STAR: Multi-lingual speech lexica including Turkish
  47. WikiPron: A multilingual pronunciation dictionary from Wiktionary
  48. TDB: Turkish Discourse Bank
  49. TCL: A lexicon of discourse connectives
  50. TED Multilingual Discourse Bank
  51. TREMO: Emotion Dataset collected through a survey
  52. TURED: Twitter Emotion Dataset (automatically labeled)
  53. TEL: Turkish Emotion Lexicon
  54. NRC-VAD: The NRC Valence, Arousal, and Dominance (English emotion lexicon translated to 100+ languages, including Turkish)
  55. NRC-EmoLex: NRC Word-Emotion Association Lexicon (English emotion lexicon translated to 100+ languages, including Turkish)
  56. NRC-EIL: NRC Emotion Intensity Lexicon (English emotion lexicon translated to 100+ languages, including Turkish)
  57. A corpus for Emotion analysis
  58. A corpus for emotion analysis on Twitter
  59. A list of "emotion words"
  60. TrClaim-19: A data set for check-worthy claim detection
  61. X-FACT: A cross-lingual fact-checking data set including Turkish
  62. Turkish FrameNet: A FrameNet for Turkish
  63. METU Corpus: a general-purpose, balanced corpus
  64. TNC: a general-purpose, balanced corpus
  65. BOUN newspaper corpus
  66. TurCo: A corpus from Dokuz Eylül
  67. TS Corpus: A general-purpose corpus
  68. A project on a historical corpus
  69. A collection of historical newspapers (1928-1942) by Istanbl University Library
  70. METU Turkish Psycholinguistic Database
  71. A spoken _English_ corpus of Turkish learners
  72. TrLex: A lexicon resource including (derivational) morphology
  73. UDer: A (preliminary) multi-lingual resource for derivational morphology
  74. UniMorph: A multi-lingual inflectional lexicon (scraped from wiktionary)
  75. BERTurk: Monolingual BERT model for Turkish
  76. mBERT: Multi-lingual BERT model (incl. Turkish)
  77. XLM-R: Multi-lingual language model
  78. Turkish radiology reports
  79. A dataset for checking gender bias
  80. IronyTR: A dataset for irony detection
  81. Another, earlier, irony dataset (precursor of IronyTR)
  82. TRmorph: A Turkish morphological analyzer
  83. A morphological analyzer from Google:
  84. TRMOR: Another morphological analyzer (SFST)
  85. The first practical morphological analyzer for Turkish
  86. A corpus manually annotated for morphology
  87. A 1M corpus with morphology (disambiguated "semi automatically")
  88. Kaggle old newspapers data for language identification
  89. Leipzig corpora: A multilingual written corpora collection
  90. OSCAR: A multi-lingual collection of web-crawled data
  91. CoNLL-2017: Supplementary data for CoNLL-2017 UD parsing shared task
  92. TrMWELexicon: A multiword expressions lexicon
  93. troff: A corpus of Turkish offensive language
  94. A Twitter offensive language data set with context
  95. Hurtlex: A multilingual lexicon of "words that hurt"
  96. A hate-speech data set from Sabancı University
  97. A hate-speech data set from Aselsan
  98. SETimes: A parallel news corpus of Balkan languages
  99. Turkish/Turkic MT resources in Apertium
  100. Human-translated Arabic-Turkish parallel data
  101. JW300: A parallel corpus of 300 languges (texts from Jehovah’s Witnesses)
  102. MADAR-Turk: A parallel corpus of (dialectal) Arabic - Turkish
  103. The BiaNet corpus (parallel TR-KU-EN)
  104. A morpheme-aligned parallel corpus (Turkish-Uzbek-English)
  105. IWLST 2013: Parallel spoken language corpus
  106. OPUS: Multi-lingual open, parallel corpora
  107. BTEC: Basic Traveling Expression Corpus (multilingual)
  108. A corpus of aligned paraphrases
  109. ParlaMint: Multi-lingual parliamentary corpora
  110. A Corpus of Grand National Assembly of Turkish Parliament’s Transcripts
  111. PanLex: A multi-lingual lexical resource
  112. BabelNet: A multi-lingual lexical resource
  113. TuPC-2016: A Turkish paraphrase corpus
  114. TQuAD: A QA data set on Turkish & Islamic Science and History
  115. A QA data set from BOUN
  116. XQuAD: A (small) multi-lingual corpus of question--answer pairs
  117. MKQA: A multi-lingual QA corpus from Apple
  118. A data set of question-answer pairs
  119. A question answering data set (includes translation from SQuAD)
  120. ConceptNet: A multi-lingual semantic network
  121. A semantically-annotated (based on UCCA) data set of 50 sentences from the METU-Sabancı treebank
  122. SemEval 2017 Task 1: Sentence similarity data set
  123. STSb-TR: A text similarity data set
  124. A translation of SentiStrength
  125. A sentiment analysis data set from İTÜ
  126. A sentiment analysis data set (from Eindhowen)
  127. A sentiment analysis data set scraped from an e-commerce site
  128. TRSAv1: A Twitter sentiment analysis data set
  129. A sentiment analysis data set (from Başkent Uni.)
  130. SemEval2016task5
  131. A sentiment analysis corpus from METU
  132. An automatically labeled large sentiment analysis corpus
  133. A corpus for political SA
  134. Another Twitter corpus for sentiment analysis
  135. A (targeted) Twitter sentiment analysis data set from BOUN
  136. SentiTurkNet: A Turkish sentiment lexicon
  137. A Turkish Movie Reviews corpus
  138. A multi-lingual sentiment lexicon
  139. BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains
  140. Another Turkish Sign Language corpus
  141. AUTSL: Another turkish sign language corpus
  142. İTÜ normalization corpus
  143. A Twitter normalization data
  144. Another Twitter normalization data
  145. Normalization data for Turkis-German code switching
  146. A normalization lexicon / corpus
  147. Multilingual LUNA: A multi-lingual, parallel speech corpus
  148. Broadcast news data from BOUN
  149. Data containing audio from movies and read-speech
  150. MuST-C: A parallel spoken corpus of TED talks
  151. Common Voice: A multi-lingual spoken corpus including Turkish
  152. MediaSpeech: A multi-lingual speech corpus including Turkish
  153. CoVoST 2: A Large-Scale Multilingual Speech-To-Text Translation Corpus
  154. Finite state pronunciation lexicon
  155. Turkish SAMPA encoding standard
  156. VoxLingua107: A multilingual speech dataset extracted from YouTube
  157. ISSAI Tukic speech corpus
  158. TurEV-DB: An emotion-voice corpus (METU)
  159. An emotion-voice corpus (Boğaziçi)
  160. A broadcast speech collection
  161. Orientel: Recordings of Telephone conversations of Turkish Speakers in Germany
  162. OrienTel-Turkish: Turkish Telephone Speech
  163. A call center speech data
  164. Hunspell dictionary for Turkish by the TDD group.
  165. METU microphone speech
  166. CoTY: The Corpus of Turkish Youth Language
  167. GlobalPhone: A Multilingual Text & Speech Database in 20 Languages
  168. Global COE: A corpus of spoken Turkish from the Global COE Program
  169. STC: Spoken Turkish corpus
  170. A corpus of student essays
  171. A summarization corpus
  172. WikiLingua: A multi-lingual corpus of abstractive summarization
  173. MLSUM: A multi-lingual news corpus for abstractive summarization
  174. TTC-4900: A text categorization data set
  175. TTC-3600: A text catogiraztion data set
  176. UD-IMST: the first Turkish treebank (UD version)
  177. UD-GB: A treebank of grammar-book examples
  178. TWT: Turkish Web Treebank by Google
  179. UD-BOUN treebank: Another treebank from BOUN
  180. UD-PUD: Turkish part of Google's parallel treebank
  181. Turkish-Penn-15: A constituency treebank of translation of sentences from the Penn TB
  182. UD-Turkish_Penn: A treebank of translation of short sentences from the Penn TB (UD version)
  183. UD-Turkish_Tourism: A domain-specific treebank
  184. UD-Turkish_KeNet: Syntactically annotated examples from the WordNet KeNet
  185. UD-Turkish_FrameNet: Syntactically annotated examples from the Turkish FrameNet
  186. UD-SAGT: A spoken Turkish-German code-switching treebank
  187. IWT: İTÜ web treebank
  188. A EN-SV-TR parallel treebank (most automatically annotated)
  189. A small UD treebank of old Turkish/Turkic
  190. METU-Sabancı treebank: the first Turkish treebank (original version)
  191. A Turkic web-crawl corpus
  192. trTenTen: A web corpus of Turkish (in Sketch Engine)
  193. Turkic word embeddings
  194. Turkish word vectors and analogical reasoning task
  195. Another set of word2vec vectors
  196. ConceptNet Numberbatch: Word vectors based on ConceptNet
  197. MUSE: Multi lingual embeddings
  198. Turkish WordNet from the BalkaNet project
  199. KeNet: Another Turkish WordNet
  200. KelimetriK: Turkish "wug-word" generator