MuseNet | xifan.uno

We’ve created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. MuseNet uses the same general-purpose unsupervised technology as GPT‑2⁠, a large-scale transformer⁠(opens in a new window) model trained to predict the next token in a sequence, whether audio or text.

Since MuseNet knows many different styles, we can blend generations in novel ways.A Here the model is given the first 6 notes of a Chopin Nocturne, but is asked to generate a piece in a pop style with piano, drums, bass, and guitar. The model manages to blend the two styles convincingly, with the full band joining in at around the 30 second mark:

We collected training data for MuseNet from many different sources. ClassicalArchives⁠(opens in a new window) and BitMidi⁠(opens in a new window) donated their large collections of MIDI files for this project, and we also found several collections online, including jazz, pop, African, Indian, and Arabic styles. Additionally, we used the MAESTRO dataset⁠(opens in a new window).

The transformer is trained on sequential data: given a set of notes, we ask it to predict the upcoming note. We experimented with several different ways to encode the MIDI files into tokens suitable for this task. First, a chordwise approach that considered every combination of notes sounding at one time as an individual “chord”, and assigned a token to each chord. Second, we tried condensing the musical patterns by only focusing on the starts of notes, and tried further compressing that using a byte pair encoding scheme.

We also tried two different methods of marking the passage of time: either tokens that were scaled according to the piece’s tempo (so that the tokens represented a musical beat or fraction of a beat), or tokens that marked absolute time in seconds. We landed on an encoding that combines expressivity with conciseness: combining the pitch, volume, and instrument information into a single token.

Footnotes

If you’re interested in other projects for creating AI generated music using transformers, we recommend checking out Magenta’s piano generation work⁠(opens in a new window).

References

For use of outputs created by MuseNet, please cite this blog post as

Payne, Christine. "MuseNet." OpenAI, 25 Apr. 2019, openai.com/blog/musenet

Please note: We do not own the music output, but kindly ask that you not charge for it. While unlikely, we make no guarantee that the music is free from external copyright claims.