Meta on Wednesday released AudioCraft, a set of three AI models capable of automatically creating sound from text descriptions.
As generative AI models that take written prompts and turn them into images or more text continue to mature, computer scientists are looking into making other forms of media using machine learning.
Audio is difficult for AI systems, especially music, since the software has to learn to produce coherent patterns over a number of minutes and be creative enough to generating something catchy or pleasant to listen to.
“A typical music track of a few minutes sampled at 44.1 kHz (which is the standard quality of music recordings) consists of millions of timesteps,” Team Meta explained. That is to say, an audio-generating model has to output a lot of data to build a human-friendly track.
“In comparison, text-based generative models like Llama and Llama 2 are fed with text processed as sub-words that represent just a few thousands of timesteps per sample.”
The Facebook giant envisions people using AudioCraft to experiment making computer-generated sounds without having to learn to play any instrument. The toolkit is made up of three models: MusicGen, AudioGen, and EnCodec.
MusicGen was trained on 20,000 hours of recordings, owned or licensed by Meta, alongside their corresponding text descriptions. AudioGen is more focused on generating sound effects rather than music, and was trained on public data. Finally, EnCodec is described as a lossy neural codec that can compress and decompress audio signals with high fidelity.
Meta said it was “open sourcing” AudioCraft, and it is to a degree. The software needed to create and train the models, and run inference, is available under an open-source MIT license. The code can be used in free (as in freedom and free beer) and commercial applications as well as research projects.
That said, the model weights are not open source. They are shared under a Creative Commons license that specifically forbids commercial use. As we saw with Llama 2, whenever Meta talks about open sourcing stuff, check the fine print.
MusicGen and AudioGen generate sounds given an input text prompt. You can hear short clips created from the descriptions “whistling with wind blowing” and “pop dance track with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach” on Meta’s AudioCraft landing page, here.
The short sound effects are realistic, though the music-like ones aren’t great in our opinion. They sound like repetitive and generic jingles for bad hold music or elevator songs rather than hit singles.
Researchers at Meta said AudioGen – described in depth here – was trained by converting raw audio into a sequence of tokens, and reconstructing the input by transforming these back into audio at high fidelity. A language model maps snippets of the input text prompt to the audio tokens to learn the correlation between words and sounds. MusicGen was trained using a similar process on music samples rather than sound effects.
“Rather than keeping the work as an impenetrable black box, being open about how we develop these models and ensuring that they’re easy for people to use — whether it’s researchers or the music community as a whole — helps people understand what these models can do, understand what they can’t do, and be empowered to actually use them,” Team Meta argued.
“In the future, generative AI could help people vastly improve iteration time by allowing them to get feedback faster during the early prototyping and grayboxing stages — whether they’re a large developer building worlds for the metaverse, a musician (amateur, professional, or otherwise) working on their next composition, or a small or medium-sized business owner looking to up-level their creative assets.”
You can fetch the AudioCraft code here, and experiment with MusicGen here and try it out. ®