Global Phone Language Models
by Ngoc Thang Vu, Tanja Schultz, 2012
Introduction
GlobalPhone is an ongoing database collection that provides transcribed speech data for the development and evaluation of large speech processing systems in the most widespread languages of the world. GlobalPhone is designed to be uniform across languages with respect to the amount of text and audio data per language, the audio data quality (microphone, noise, channel), the collection scenario (task, setup, speaking style etc.), and the transcription conventions. The GlobalPhone corpus provides an excellent basis for research in the areas of (1) multilingual speech recognition, (2) rapid deployment of speech processing systems to new languages, (3) language and speaker identification tasks, (4) multilingual speech synthesis, (5) monolingual speech recognition in a large variety of languages, as well as (6) comparisons across major languages based on text and speech data.
Download 3-gram Language Models
Languages | Perplexity (PPL) | OOV [%] | Vocabulary size | Download |
---|---|---|---|---|
Bulgarian | 454 | 1.0 | 274k | BG.lm |
Czech | 1421 | 4.0 | 267k | CZ.lm |
French | 324 | 2.4 | 65k | FR.lm |
German | 672 | 0.3 | 38k | GE.lm |
Hausa | 97 | 0.5 | 41k | HAU.lm |
Croatian | 721 | 3.6 | 362k | HR.lm |
Japanese | 89 | 1.0 | 67k | JP.lm |
Korean(char) | 25 | 0 | 1.3k | KO.lm |
Mandarin | 262 | 0.8 | 13k | MAN.lm |
Portuguese | 58 | 9.8 | 62k | PT.lm |
Polish | 951 | 0.8 | 243k | PL.lm |
Russian | 1310 | 3.9 | 293k | RU.lm |
Spanish | 154 | 0.1 | 19k | SP.lm |
Swedish | 423 | 5.3 | 73k | SWE.lm |
Tamil | 730 | 1.0 | 288k | TA.lm |
Thai | 70 | 0.1 | 22k | TH.lm |
Turkish | XXX | 13.2 | 29k | TU.lm |
Vietnamese | 218 | 0 | 30k | VN.lm |
Contact
Acknowledgements
We would like to thank to all who have helped us to collect the data corpus.
References
GlobalPhone: A Multilingual Text and Speech Database in 20 Languages
Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe. In Proc. of ICASSP, Canada, 2013
Dateigr??e: 110 KBDateiname: globalPhone_ICASSP2013.pdf?nderungsdatum: 04.08.2017Language Independent and Language Adaptive Acoustic Modeling for Speech Recognition
Tanja Schultz and Alex Waibel, Speech Communication, Volume 35, Issue 1-2, pp 31-51
Dateigr??e: 256 KBDateiname: SchultzSpecom26062000.pdf?nderungsdatum: 04.08.2017GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University
Tanja Schultz. In Proc. of the International Conference of Spoken Language Processing, ICSLP, Denver, CO, 2002
Dateigr??e: 100 KBDateiname: schultz_icslp02.pdf?nderungsdatum: 04.08.2017Rapid Bootstrapping of five Eastern European Languages using the Rapid Language Adaptation Toolkit
Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, and Tanja Schultz. In Proc. of Interspeech, Japan, 2010
Dateigr??e: 164 KBDateiname: thangvu_Interspeech2010.pdf?nderungsdatum: 04.08.2017Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil
Ngoc Thang Vu, Franziska, Tanja Schultz. In Proc. of ICASSP, Czech, 2011
Dateigr??e: 165 KBDateiname: VuKrausSchultz_ICASSP2011.pdf?nderungsdatum: 04.08.2017
For further Publications on language specific issues, please refer to the CSL publication page.