Musical Speech

A Transformer-based Composition Tool


F.A.Q. (Frequently Asked Questions)



This system is meant to be a tool to assist in musical composition by analysing speech and generating potential musical material based on that speech. While we find this material interesting to listen to on its own, and often draws our attention to elements of the speech that we might not have otherwise noticed, this material is in no way intended to be a complete (i.e. "stand-alone") composition in itself. Rather, it is material which can then be taken by a composer or producer and used for further processing! See the answers to "Has anything like this been done before?" to better understand the underlying motivation, and see the answers to "How were the samples created?" to better understand how this material can be processed. A simplified version of the potential workflow is illustrated in this sample demo, which illustrates the progression, beginning with (1) raw speech, followed by (2) close pitch tracking, (3) transformer output with slight adjustment, (4) accompaniment created to accompany the output of the previous step and finally (5) mixing it all together.

Fun to have fun




Absolutely! This is very much inspired by some of the favourite techniques of some of the authors! (Though it has almost always involved an extensive manual element, with few supporting technologies other than rewind-and-listen-again.) Just a few examples include:

Despite all these examples, the process of going from a raw speech recording to a produced musical excerpt is one that requires extensive fine-tuning. To some extent, this will always be the case, but our system aims to support this process, and uses generative musical models to do so, thus providing the composer and producer with controllable options to allow transparent and easier iteration.


To be clear, we assume this question is referring not to the saved examples on the interactive page, but to samples such as this one:


Moo cow


and others on this page. This question is answered here in some detail by one of the co-authors, composer Dani Oore.


Fundamental frequncy (F0) and loudness envelope of the speech signal are computed. The features are extracted at frame-level using a frame size of 50 msec and frame shift of 20 msec.


Loudness envelope extracted from the speech signal is processed to obtain regions of interest in speech. The regions of interest are selected using two different approaches:

  1. Approach 1 - Pre-defined threshold: In this approach, only frames with loudness value above a pre-defined threshold are potential regions for further processing. The pre-defined threshold is set based on empirical analysis.
  2. Approach 2 - Peak picking: In this approach, peaks in the loudness envelope are considered as the potential regions for further processing. The peaks in the loudness envelope generally fall in the vicinity of syllable nuclei.
The selected regions are further refined by selecting only regions where F0 values are greater than 60 Hz. This step is performed to ensure that the regions of interest fall in the voiced regions where the F0 value estimation is more reliable. We will consider the F0 values obtained in the retained frames for further processing. In our interactive demo, in the sparsification type, v1 (v1:Low, v1:Medium and v1:High) follows Approach 1 and v2 (v2:Low, v2:Medium and v2:High) follows Approach 2, respectively, to obtain the sparsified representation of the speech.


In our interactive demo, under sparsification type, you can see 6 options (v1:Low, v2:Low, v1:Medium, v2:Medium, v1:High and v2:High). The different options with v1 (v1:Low, v1:Medium and v1:High) and v2 (v2:Low, v2:Medium and v2:High) refer to the different levels of sparsification obtained using Approach 1 and Approach 2 (explained in above question), respectively. The options Low, Medium, High refer to the different levels of sparsification. Here "Low" refers to lower level of sparsification (most of the F0 values extracted from speech are retained) and "High" refers to the higher level of sparsification (very few of the F0 values extracted from speech are retained).

For Approach 1, the threshold is varied depending on the level of sparsification. For Approach 2, the context (number of frames selected) around each peak location are varied depending on the level of sparsification. For instance in Approach 2, for the option v2:Low (lower level of sparsification), two frames on each side of the peak location (along with the peak location) are selected for further processing.


Please feel free to get in touch with us, we'd love to hear from you!