NeurIPS 2020 Musical Speech FAQ

F.A.Q. (Frequently Asked Questions)

This system is meant to be a tool to assist in musical composition by analysing speech and generating potential musical material based on that speech. While we find this material interesting to listen to on its own, and often draws our attention to elements of the speech that we might not have otherwise noticed, this material is in no way intended to be a complete (i.e. "stand-alone") composition in itself. Rather, it is material which can then be taken by a composer or producer and used for further processing! See the answers to "Has anything like this been done before?" to better understand the underlying motivation, and see the answers to "How were the samples created?" to better understand how this material can be processed. A simplified version of the potential workflow is illustrated in this sample demo, which illustrates the progression, beginning with (1) raw speech, followed by (2) close pitch tracking, (3) transformer output with slight adjustment, (4) accompaniment created to accompany the output of the previous step and finally (5) mixing it all together.

Fun to have fun

Absolutely! This is very much inspired by some of the favourite techniques of some of the authors! (Though it has almost always involved an extensive manual element, with few supporting technologies other than rewind-and-listen-again.) Just a few examples include:

The great Brazilian jazz pianist and composer, Hermeto Pascoal, used to do this often, and referred to it as the "Aura Sound" of a person's voice. You can watch him do this with the voice of actor Yves Montand in this video, where he first finds the pitches and then improvises music around these pitches. (While he might make it look easy, it takes a lot of practise to be able to hear and interpret the musical pitch of a speaking voice!)
There have been entire albums, in a variety of genres, that have been inspired by the explicit conversion of speech into music, and much more work where the musicality of speech is an implicit source of inspiration. For example, this movie is about an album called the Happiness Project, based on interviews with people in Toronto.
In 2009, a ``speaking piano'' recited the Proclamation of the European Environmental Criminal Court at the World Venice Forum. This was the work of composer Peter Ablinger, and narrator Miro Markus (an elementary school student at the time).
One of the co-authors has been using this approach in his own musical compositions (e.g. Marianne, Riperian Dan, Sound is Touch).
The application of this technique specifically to speeches by Donald Trump is discussed and analyzed (along with a youtube playlist) in this book chapter by D. Oore
Some commercial music production software (e.g. Ableton Live) has begun to try to incorporate some proprietary voice-pitch-tracking technology.
Another co-author had to use the "inverse" (i.e. generative) process to control the prosody of speech in real-time: in the GloveTalk II project, he used his vertical hand position to control the F0 value of a speech synthesizer. This required effectively "hearing" the musical pitch contour in a way disentangled from the rest of the speech signal.

Despite all these examples, the process of going from a raw speech recording to a produced musical excerpt is one that requires extensive fine-tuning. To some extent, this will always be the case, but our system aims to support this process, and uses generative musical models to do so, thus providing the composer and producer with controllable options to allow transparent and easier iteration.

To be clear, we assume this question is referring not to the saved examples on the interactive page, but to samples such as this one:

Moo cow

and others on this page. This question is answered here in some detail by one of the co-authors, composer Dani Oore.

Fundamental frequncy (F0) and loudness envelope of the speech signal are computed. The features are extracted at frame-level using a frame size of 50 msec and frame shift of 20 msec.

Loudness envelope extracted from the speech signal is processed to obtain regions of interest in speech. The regions of interest are selected using two different approaches:

Approach 1 - Pre-defined threshold: In this approach, only frames with loudness value above a pre-defined threshold are potential regions for further processing. The pre-defined threshold is set based on empirical analysis.
Approach 2 - Peak picking: In this approach, peaks in the loudness envelope are considered as the potential regions for further processing. The peaks in the loudness envelope generally fall in the vicinity of syllable nuclei.

The selected regions are further refined by selecting only regions where F0 values are greater than 60 Hz. This step is performed to ensure that the regions of interest fall in the voiced regions where the F0 value estimation is more reliable. We will consider the F0 values obtained in the retained frames for further processing. In our interactive demo, in the sparsification type, v1 (v1:Low, v1:Medium and v1:High) follows Approach 1 and v2 (v2:Low, v2:Medium and v2:High) follows Approach 2, respectively, to obtain the sparsified representation of the speech.

In our interactive demo, under sparsification type, you can see 6 options (v1:Low, v2:Low, v1:Medium, v2:Medium, v1:High and v2:High). The different options with v1 (v1:Low, v1:Medium and v1:High) and v2 (v2:Low, v2:Medium and v2:High) refer to the different levels of sparsification obtained using Approach 1 and Approach 2 (explained in above question), respectively. The options Low, Medium, High refer to the different levels of sparsification. Here "Low" refers to lower level of sparsification (most of the F0 values extracted from speech are retained) and "High" refers to the higher level of sparsification (very few of the F0 values extracted from speech are retained).

For Approach 1, the threshold is varied depending on the level of sparsification. For Approach 2, the context (number of frames selected) around each peak location are varied depending on the level of sparsification. For instance in Approach 2, for the option v2:Low (lower level of sparsification), two frames on each side of the peak location (along with the peak location) are selected for further processing.

Please feel free to get in touch with us, we'd love to hear from you!

Musical Speech

A Transformer-based Composition Tool

F.A.Q. (Frequently Asked Questions)