Global Phone Language Models

by Ngoc Thang Vu, Tanja Schultz, 2012

GlobalPhone is an ongoing database collection that provides transcribed speech data for the development and evaluation of large speech processing systems in the most widespread languages of the world. GlobalPhone is designed to be uniform across languages with respect to the amount of text and audio data per language, the audio data quality (microphone, noise, channel), the collection scenario (task, setup, speaking style etc.), and the transcription conventions. The GlobalPhone corpus provides an excellent basis for research in the areas of (1) multilingual speech recognition, (2) rapid deployment of speech processing systems to new languages, (3) language and speaker identification tasks, (4) multilingual speech synthesis, (5) monolingual speech recognition in a large variety of languages, as well as (6) comparisons across major languages based on text and speech data.

Languages	Perplexity (PPL)	OOV [%]	Vocabulary size	Download
Bulgarian	454	1.0	274k	BG.lm
Czech	1421	4.0	267k	CZ.lm
French	324	2.4	65k	FR.lm
German	672	0.3	38k	GE.lm
Hausa	97	0.5	41k	HAU.lm
Croatian	721	3.6	362k	HR.lm
Japanese	89	1.0	67k	JP.lm
Korean(char)	25	0	1.3k	KO.lm
Mandarin	262	0.8	13k	MAN.lm
Portuguese	58	9.8	62k	PT.lm
Polish	951	0.8	243k	PL.lm
Russian	1310	3.9	293k	RU.lm
Spanish	154	0.1	19k	SP.lm
Swedish	423	5.3	73k	SWE.lm
Tamil	730	1.0	288k	TA.lm
Thai	70	0.1	22k	TH.lm
Turkish	XXX	13.2	29k	TU.lm
Vietnamese	218	0	30k	VN.lm

Contact

Tanja Schultz

Acknowledgements

We would like to thank to all who have helped us to collect the data corpus.

GlobalPhone: A Multilingual Text and Speech Database in 20 Languages
Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe. In Proc. of ICASSP, Canada, 2013
File size: 110 KB
File name: globalPhone_ICASSP2013.pdf
Last update: 04.08.2017
Language Independent and Language Adaptive Acoustic Modeling for Speech Recognition
Tanja Schultz and Alex Waibel, Speech Communication, Volume 35, Issue 1-2, pp 31-51
File size: 256 KB
File name: SchultzSpecom26062000.pdf
Last update: 04.08.2017
GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University
Tanja Schultz. In Proc. of the International Conference of Spoken Language Processing, ICSLP, Denver, CO, 2002
File size: 100 KB
File name: schultz_icslp02.pdf
Last update: 04.08.2017
Rapid Bootstrapping of five Eastern European Languages using the Rapid Language Adaptation Toolkit
Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, and Tanja Schultz. In Proc. of Interspeech, Japan, 2010
File size: 164 KB
File name: thangvu_Interspeech2010.pdf
Last update: 04.08.2017
Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil
Ngoc Thang Vu, Franziska, Tanja Schultz. In Proc. of ICASSP, Czech, 2011
File size: 165 KB
File name: VuKrausSchultz_ICASSP2011.pdf
Last update: 04.08.2017