I am experimenting with “In-Depth”, a long form edition of our blog about topics related to AI and music, covered in greater detail. The aim is to make much of this knowledge accessible to a wider audience. This article might take you 8 minutes to read.
AI Generative Models
When ChatGPT was introduced last fall, it created ripples through the technology industry and the world at large. Though machine learning researchers had been experimenting with Large Language Models (LLMs) for several years, the general public hadn’t realized how important they had become until then.
Now, we’re seeing a similar trend with music generation models that are growing stronger and are close to a breakthrough. Just two months ago, Meta AI (the research team at Facebook) released a music generation model, which they made open source last week for everyone to use. I believe music generation AI will soon be widely used, and their application will far outstrip the capabilities of voice cloning AI.
Music generation models share a similar architecture to ChatGPT. For those familiar with the subject, you’ll know that LLMs are trained to “predict the next word,” requiring enormous amounts of text to do so. However, explanations often end here, treating the details of prediction as a mystery. This is in part due to the unique way these systems were developed.
Unlike conventional software, which are created by human programmers writing step-by-step instructions to computers, LLMs and music generation models operate differently. They are based on a machine learning method called a neural network that is trained using billions of data points of text or audio.
As a result, the intricacies of their inner workings are not fully understood by anyone yet. While researchers in the AI community aim for a deeper comprehension, it’s a slow process that may take years, or even decades, to complete.
Despite these complexities, there’s much that we understand about how these systems work. I’ll attempt to explain the inner workings of music generation models, like Meta’s AI, without resorting to technical jargon or math.
We’ll begin by explaining audio tokens, the innovative way music generation models represent and interpret music. After that, we’ll delve into the transformer, the fundamental building block for systems like ChatGPT. Finally, we’ll discuss how these models are trained and why successful performance requires such extraordinarily large quantities of data.
First, music generation models need a way to represent audio that a computer can understand. After all, computers deal with numbers and rules, not melodies and harmonies. Models use an audio encoder to transform recordings of real music into data it can learn from. These data are represented as numbers that computers can understand.
You can imagine the audio encoder as a musician listening closely to a song and writing down everything they hear in great detail. Instead of writing chords and notes, the encoder breaks the music into tiny slices of sound called audio tokens.
The audio encoder creates a token to capture a fraction of a second of audio – a violin note, a drum beat, a piano chord, and so on. By stringing thousands of these tokens together, the encoder can accurately capture every nuance of a music performance. It creates a “vocabulary” of tokens of all musical styles in the world.
So a Jazz song might be encoded as a long sequence of tokens like “38, 92, 456, 23…” for the duration of the audio file. There are thousands of unique tokens the encoder can use to represent all the different sounds and notes in music.
The model trains on these tokens, learning patterns and relationships between them. At first, it has no idea how tokens relate to each other. It starts out making completely random guesses at what token might come next in a sequence.
But by analyzing millions and millions of tokens from training data, the model starts to recognize patterns. For example, it learns that in Jazz, token “38” is often followed by token “92”, which is often followed by token “456”. A sequence like “92, 456, 23” is common in rock music but rare in classical.
The model then relies on a neural network architecture called a “transformer” to detect these patterns. The transformer allows each token to focus only on relevant context to inform the next token prediction.
The transformer uses a technique called attention that allows the token “456” to pay closer attention to the tokens nearby like “92” and “23”, because those are most relevant for predicting what should come next.
The token “456” will likely pay less attention to “38” which is farther away, or even completely ignore tokens from much earlier in the sequence. The closer tokens provide more relevant context.
So if the model has learned that in Jazz music, “92, 456” is often followed by “23”, it can use that pattern and focus its attention appropriately, making “23” a highly probable prediction for the next token.
Whereas if it had just tried to predict the next token based on “38” which is too far away, it wouldn’t have the right relevant context to realize that “23” commonly follows “456” in Jazz specifically. This ability to focus on nearby relevant context makes the transformer architecture effective.
Feed Forward Networks
After the transformer pays attention to the relevant context, the next step is actually predicting the next token in the sequence. This prediction is done by what are called feedforward networks and their job is to recognize familiar sequences.
For instance, through training the feed forward network might learn that the sequence “38, 92, 456” in Jazz music tends to be followed by token “23”. So when it sees those input tokens, the trained feedforward network will output a prediction that “23” should be the next token.
During training, the network will compare its predictions to the actual next tokens. Then it will automatically tweak the direction slightly to make its future predictions more accurate. This is what training an AI model entails – it repeats this process over and over and until it tweaks its direction to match the actual next token. This process can take weeks or months and it’s computationally expensive.
Over many weeks and training examples, the feedforward networks get really good at predicting the next tokens by learning all the musical patterns. When generating music, the predictions get strung together into full melodies and compositions.
The final part of the model is to give instructions with written descriptions like “a calm piano piece”. There’s a clever trick called conditioning that takes the instruction text and translates it into numbers.
Similar to the audio tokens, imagine a sequence of numbers, a vector, that captures attributes of the desired music – like [0.2, 0.7, 0.1, …] for “calm piano” or [0.7, 0.2, 0.5, …] for “energetic rock”. Vectors provide a way for music generation models to understand text.
Inside the model, these attention layers allow it to incorporate the text vector as it generates the token sequence in music. So when we enter the text “calm piano” that text is transformed into a vector, and the model is more likely to produce gentle, sustained tokens corresponding to piano sounds. The text acts like a guide steering the music generation process.
So far we’ve discussed how model learn musical patterns and leverage text descriptions, but how do discrete tokens actually turn into the continuous experience of listening to music? This is where the audio decoder comes in.
The decoder takes the sequence of predicted audio tokens and transforms them back into an audio waveform – the raw musical signal that can be played through speakers. This step reconstructs the smooth, flowing audio from the discrete tokens.
Today’s audio decoders, like Meta’s MusicGen, are capable of generating music at sampling rates of 32kHz, capturing a significant portion of the full dynamic range of music. However, decoding these audio tokens into high-fidelity waveforms, like studio-quality results, is a highly memory-intensive task.
While the average listener may not distinguish between sample rates of 44.1kHz or 48kHz, the difference at 32kHz becomes slightly apparent, implying that we have yet to reach the peak of audio fidelity in this area.
Generating music at 44.1 kHz has many challenges. It involves handling longer sequences of audio tokens, which means more sophisticated audio modeling, accommodating longer contexts, and while managing slower training periods. MusicGen has shown promising results at 32 kHz, but further work is needed to scale up to even higher levels of fidelity for studio production quality.
As of now, Meta’s MusicGen generates instrumental music without vocals. It is not yet available as an App or as a plug-in for a Digital Audio Workstation (DAW). Instead, you can access it as a demo model at Hugging Face, where you can experiment with it using different text descriptions.
On the positive side, MusicGen appears capable of producing musical excerpts that sound pleasant and stylistically consistent when conditioned on genres like rock or piano music. The samples I’ve heard sound musically plausible and emotionally evocative more than Google’s music generation AI, MusicLM.
For both models, the fact that it learned human-like music by merely predicting sequences during training, without needing explicit rules, is remarkable. This demonstrates the power of machine learning from big datasets.
However, both MusicGen and MusicLM generations tend to be somewhat repetitive and lack longer-term structure compared to what a skilled human composer can produce. The music often meanders without a clear overarching progression or development.
The range of instruments and sounds it can generate also seems more limited than a versatile human musician. And of course, the AI has no understanding of the meaning of structure or emotion behind the music it creates – it simply predicts the next audio token.
I would assess MusicGen’s and Music LM’s current musical proficiency as impressive given it’s fully AI-generated, but still different from human-level mastery of creativity, structure and emotional expression.
This field is moving fast, it is challenging to keep up, and it raises crucial questions. How will such technology influence creative industries and their communities? Should we acknowledge the human artists who provided the training data for these models? Is it possible for AI to truly comprehend art, or is it merely mimicking it? Aren’t we all borrowing ideas and remixing each other?
In the United States, the Recording Academy has made it clear that music entirely created by AI will not be considered for awards. Personally, I agree with this decision. However, within a few months or a year, distinguishing an AI-created song from a human-composed one may become impossible.
My takeaway is that musicians, songwriters, and music producers will leverage music generation AI in ways that supplement their creative processes, augmenting or enhancing specific aspects. For instance, a vocalist might use AI to generate backing tracks, or an experienced beat producer might employ AI to create a top-line melody and modern chord structures.
I envision a future where companies behind these AI technologies license their systems, sharing a portion of recording royalties for the music created in collaboration with their AI. This would effectively make each song a joint venture between the artist and the AI, or the company that developed the AI. Of course, this introduces a new set of challenges for the collection and management of royalties.
Those artists who stand to gain most from music generation models are those who have mastered their craft in songwriting or music production. They will use AI as an assistant music producer, conditioning their style to any genre, any artist, or any era. This application could add an extra dimension to their creativity, adding jet fuel to their artistic flexibility. It’s artists like these who will prove to be challenging, if not impossible, to replace with AI.
Unbias is building an AI product with marketing capabilities. Join our waitlist here to get early access.