What Do Forced Alignment Likelihood Scores Tell Us About the Aligned Speech?
Standard forced alignment systems are a widely used tool in phonetic research. Forced alignment uses Hidden Markov Models to align a sequence of phones to a sound recording. As a byproduct, it computes likelihood scores for every aligned phone and word. This study investigates the extent to which these likelihood scores can be: (1) pressed into use in speech research, (2) interpreted as a measure of acoustic distance (of some variety) to the modeled phones and place individual aligned segments within their distribution of phonetic variation. The present study is a first step in accomplishing these goals. To this end, first vowels in hold-out portions of the TIMIT (Zue & Seneff 1988) and Buckeye (Pitt et al. 2005) corpora were cross-aligned with phone models trained on the remaining portions of those corpora (tokens of [i] were aligned with the [?] phone, the [e] phone, the [?] phone, etc.), and the resulting likelihood scores were compared to acoustic measures like duration and formant frequencies to determine which acoustic properties are encapsulated in the scores. These were also compared with scores provided by the freely available Penn Forced Aligner (Yuan & Liberman, 2008). Preliminary analyses find a strong correlation between the cross alignment scores and F1 x F2 geometric distance, as well as the duration of the phones. This establishes that these probability measures show a relationship with some acoustic characteristics of the segments. The results of this initial analysis are promising. Future evaluation is needed to explore the full scope and limitations of the application of these measures.
 Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., & Raymond, W. 2005. The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication 45, 89-95.
 Zue, V. & Seneff, S. Transcription and Alignment of the TIMIT Database. Proceedings of the 2nd Meeting on Advanced Man -- Machine Interface through Spoken Language 1988, 11.1-11.10. Yuan, J. & Liberman, M. 2008. Speaker identification on the SCOTUS corpus. Proceedings of Acoustics 2008.
i) to publish, reproduce, distribute, display and store the Contribution;
ii) to translate the Contribution into other languages, create adaptations, reprints, include within collections and create summaries, extracts and/or, abstracts of the Contribution;
iii) to exploit all subsidiary rights in the Contribution,
iv) to provide the inclusion of electronic links from the Contribution to third party material where-ever it may be located;
v) to licence any third party to do any or all of the above.