Stable Audio Open: Research Paper

Key Takeaways:

  • Stable Audio Open research paper describes the architecture and training process of Stability AI’s new open-weights text-to-audio model trained with Creative Commons data. 

  • Stable Audio Open weights are available on Hugging Face. The model is released under Stability AI Community License that allows non-commercial use and commercial use for individuals or organizations with up to $1M in annual revenue. Contact us for Enterprise Licenses.

  • The model can generate high-quality stereo audio at 44.1kHz from text prompts, and can be used to synthesize realistic sounds and field recordings.

  • Stable Audio Open runs on consumer-grade GPUs, making it accessible for academic purposes and artistic use cases.

Following the open source release of Stable Audio Open, we are excited to share the research paper outlining the technical details behind the model. The paper is accessible on arXiv and the model weights are available on Hugging Face.

Architecture

Stable Audio Open introduces a text-to-audio model with three key components:

  • An autoencoder that compresses waveforms into a manageable sequence length

  • A T5-based text embedding for text conditioning

  • A transformer-based diffusion model (DiT) operating in the latent space of the autoencoder.

The model generates variable-length stereo audio at 44.1kHz, up to 47 seconds. The autoencoder achieves a low latent rate of 21.5Hz, which can work for music and audio. Stable Audio Open is a variant of Stable Audio 2.0, but trained on a different dataset (Creative Commons data). This architecture is similar, but uses T5 text conditioning instead of CLAP.

Training Data

Stable Audio Open was trained using nearly 500,000 recordings licensed under CC-0, CC-BY, or CC-Sampling+. The data set consists of 472,618 recordings from Freesound and 13,874 from the Free Music Archive (FMA). 

To ensure no copyrighted material was included, this content was carefully curated by identifying music samples in Freesound using the PANNs audio tagger. The identified samples were sent to Audible Magic's content detection company to ensure the removal of potential copyrighted music from the dataset.

Use Cases

Stable Audio Open can be fine-tuned to customize audio generation such as adapting the length of generated content, or meeting precise needs of various industries and creative projects. Users can train the model locally with A6000 GPUs. To get help with prompting, check out some tips for Stable Audio 2.0

Here are some examples of applications, both for off-the-shelf use of the model and for fine-tuning or integration into workflows:

Sound Design

  • Sounds Effects and Foley Effects: Generate sound effects such as footsteps, door creaks, or environmental sounds that can be used for film, television, video games and game development.

  • Ambient Sounds: Create soundscapes or background textures that fit the mood and atmosphere of a scene.

  • Sample Creation: Generate drum loops and music samples for producing music tracks.

Commercial and Marketing Applications

  • Audio Branding: Create sound effects for advertisements or develop audio logos and brand sounds to enhance brand recognition and identity through custom audio elements.

Education and Research

  • Academic Projects: Use the model for research in audio synthesis, machine learning, and musicology to experiment with and analyze generated audio.

In this demo you can find more examples and see how Stable Audio Open performance compares to other models.  

Conclusions

The release of Stable Audio Open marks a significant milestone in open-source audio AI. It offers high-quality stereo sound generation at 44.1kHz and runs on consumer-grade GPUs, with focus on data transparency. While acknowledging limitations in areas such as speech and music generation, the model's accessibility and performance make it a valuable tool for both researchers and artists, pushing the boundaries of what's possible with open audio AI.

The Stable Audio Open model weights are available on Hugging Face. We encourage sound designers, musicians, developers and audio enthusiasts to download the model, explore its capabilities and share examples of how they use Stable Audio Open.

To stay updated on our progress, follow us on Twitter, Instagram, LinkedIn, and join our Discord Community.

Previous
Previous

Introducing Stable Video 4D, Our Latest AI Model for Dynamic Multi-Angle Video Generation

Next
Next

Stability AI Releases Stable Assistant Features