Forced-alignment of the sung acoustic signal using deep neural nets
Abstract
Sung speech shows significant acoustic differences from normal speech, both careful and spontaneous speech. To analyse and better understand why sung speech presents a unique challenge for tools such as forced aligners and automatic transcribers, we trained a deep neural network to extract phone-level information from a sung acoustic signal. The current best network takes as input raw audio from a singer and outputs time-aligned phoneme labels that predict the phoneme that the singer is producing at ten millisecond increments. We use audio data from the Folkways collection, as maintained by the University of Alberta Sound Studies Institute. The data consists of several folk songs, mostly sung acapella by a few individual singers. Before being used as training or testing data, each song was aligned by hand, sectioning off each individual phoneme that appears and setting the start and endpoint. The data is also cut into twenty-five millisecond frames spaced ten milliseconds apart. Each will receive a label from the network, which will be compared with the label given by the transcription in order to evaluate the network’s performance. To further increase the amount of training data, all of the data was duplicated and noise was added to them. The performance of the network is evaluated automatically during training by comparing the output label that the network chose for a given frame to the label assigned to that frame by the human transcriber. After all of the frames have been evaluated, the network is assigned an accuracy score that reflects how many labels it assigned correctly. By this method, we found that the acoustic differences between speech and sung speech are significantly different enough that the tasks require separate acoustic models. However, using training data from both genres increased the accuracy of the overall model.
Additional Files
Published
How to Cite
Issue
Section
License
Author Licensing Addendum
This Licensing Addendum ("Addendum") is entered into between the undersigned Author(s) and Canadian Acoustics journal published by the Canadian Acoustical Association (hereinafter referred to as the "Publisher"). The Author(s) and the Publisher agree as follows:
-
Retained Rights: The Author(s) retain(s) the following rights:
- The right to reproduce, distribute, and publicly display the Work on the Author's personal website or the website of the Author's institution.
- The right to use the Work in the Author's teaching activities and presentations.
- The right to include the Work in a compilation for the Author's personal use, not for sale.
-
Grant of License: The Author(s) grant(s) to the Publisher a worldwide exclusive license to publish, reproduce, distribute, and display the Work in Canadian Acoustics and any other formats and media deemed appropriate by the Publisher.
-
Attribution: The Publisher agrees to include proper attribution to the Author(s) in all publications and reproductions of the Work.
-
No Conflict: This Addendum is intended to be in harmony with, and not in conflict with, the terms and conditions of the original agreement entered into between the Author(s) and the Publisher.
-
Copyright Clause: Copyright on articles is held by the Author(s). The corresponding Author has the right to grant on behalf of all Authors and does grant on behalf of all Authors, a worldwide exclusive license to the Publisher and its licensees in perpetuity, in all forms, formats, and media (whether known now or created in the future), including but not limited to the rights to publish, reproduce, distribute, display, store, translate, create adaptations, reprints, include within collections, and create summaries, extracts, and/or abstracts of the Contribution.