Introducing Stable Audio 2.0

Key Takeaways

  • Stable Audio 2.0 sets a new standard in AI-generated audio, producing high-quality, full tracks with coherent musical structure up to three minutes in length at 44.1kHz stereo.

  • The new model introduces audio-to-audio generation by allowing users to upload and transform samples using natural language prompts.

  • Stable Audio 2.0 was exclusively trained on a licensed dataset from the AudioSparx music library, honoring opt-out requests and ensuring fair compensation for creators.

  • Explore the model and start creating for free on the Stable Audio website now.

Today, we are pleased to introduce Stable Audio 2.0. This model enables high-quality, full tracks with coherent musical structure up to three minutes long at 44.1 kHz stereo from a single natural language prompt.

The new model goes beyond text-to-audio to include audio-to-audio capabilities. Users can now upload audio samples and, through natural language prompts, transform these samples into a wide array of sounds. This update also expands sound effect generation and style transfer, providing artists and musicians more flexibility, control, and an elevated creative process.

Stable Audio 2.0 builds upon Stable Audio 1.0, which debuted in September 2023 as the first commercially viable AI music generation tool capable of producing high-quality 44.1kHz music, leveraging latent diffusion technology. It has since been named one of TIME’s Best Inventions of 2023.


This new model is available to use today for free on the Stable Audio website and will soon be available on the Stable Audio API.


New Features

Our most advanced audio model yet expands the creative toolkit for artists and musicians with its new functionalities. With both text-to-audio and audio-to-audio prompting, users can produce melodies, backing tracks, stems, and sound effects, thus enhancing the creative process.

Full Length Tracks

Stable Audio 2.0 sets itself apart from other state-of-the-art models as it can generate songs up to three minutes in length, complete with structured compositions that include an intro, development, and outro, as well as stereo sound effects.

Audio-to-Audio Generation 

Stable Audio 2.0 now supports audio file uploads to transform ideas into fully produced samples. Our Terms of Service require that uploads be free of copyrighted material, and we use advanced content recognition to maintain compliance and prevent infringement.

Variations and Sound Effects Creation

This model amplifies the production of sound and audio effects, from the tapping on a keyboard to the roar of a crowd or the hum of city streets, it offers new ways to elevate audio projects. 

Style Transfer

This new feature seamlessly modifies newly generated or uploaded audio within the generation process. This capability allows for the customization of the output's theme, to align with a project's specific style and tone.

Research

The architecture of the Stable Audio 2.0 latent diffusion model is specifically designed to enable the generation of full tracks with coherent structures. To achieve this, we have adapted all components of the system for improved performance over long time scales. A new, highly compressed autoencoder compresses raw audio waveforms into much shorter representations. For the diffusion model, we employ a diffusion transformer (DiT), akin to that used in Stable Diffusion 3, in place of the previous U-Net, as it is more adept at manipulating data over long sequences. The combination of these two elements results in a model capable of recognizing and reproducing the large-scale structures that are essential for high-quality musical compositions.

Stay tuned for the release of the research paper with additional technical details.

The Autoencoder condenses audio and reconstructs it back to its original state. It captures and reproduces the essential features while filtering out less important details for more coherent generations.

A Diffusion Transformer (DiT) refines random noise into structured data incrementally, identifying intricate patterns and relationships. Combined with the Autoencoder, it gains the capability to process longer sequences to create a deeper, more accurate interpretation from inputs.

Safeguards

Like the 1.0 model, 2.0 is trained on data from AudioSparx consisting of over 800,000 audio files containing music, sound effects, and single-instrument stems, as well as corresponding text metadata. All of AudioSparx’s artists were given the option to 'opt out' of the Stable Audio model training.

To protect creator copyrights, for audio uploads, we partner with Audible Magic to utilize their content recognition (ACR) technology to power real-time content matching to prevent copyright infringement.

Stable Radio

Stable Radio, a 24/7 live stream that features tracks exclusively generated by Stable Audio, is now streaming on the Stable Audio YouTube channel


Explore the model and start creating for free on the Stable Audio website now.

To stay updated on Stable Audio, follow us on Twitter and Instagram

Previous
Previous

Introducing Stable LM 2 12B

Next
Next

Introducing Stable Code Instruct 3B