Stems on the roadmap: what v1 ships, what v1.5 adds

Stems are not in v1 of SONICHAOS AI Studio. v1 renders a stereo bounce. v1.5 adds post-render Demucs separation into drums, bass, vocals, and other, plus optional re-rendering of one stem from a re-prompted persona.

May 9, 2026 · 4 min read · SONICHAOS editorial

Stems are a v1.5 feature. The honest version of a roadmap names the version, not the quarter.

A question that lands in the inbox most weeks: do SONICHAOS AI Studio renders come with stems. The honest answer is no, not yet. v1 prints a stereo bounce — 24-bit WAV, 44.1 kHz, written to object storage with a signed URL on the job row. There is no per-instrument multitrack on the other end of Generate. This post is a scope statement, not a tease. What follows is what v1 does, what v1.5 will do, and how the file contract holds across both.

What v1 actually delivers

The v1 render path ends at a single file. The latent-diffusion stack denoises a 96-channel mel-spectrogram, the vocoder lifts mel back to waveform, and the result is muxed to a stereo 24-bit WAV. There is no intermediate tensor for drums or bass or vocals because the model does not generate the song that way. It generates the mix as one signal conditioned on the persona embedding and the prompt.

We ship the bounce, the prompt, the persona id, the seed, the model version, and the integrated LUFS reading. The audit row is what makes the take licensable under the Studio tier. The Studio licence covers the bounce. No per-stem licence exists in v1, because there are no stems to licence.

If you need stems today, the workflow is to render the bounce, drop it into your DAW, and run a separator yourself. That is fine. v1.5 just moves that step inside the product so the file contract matches the licence.

The v1.5 plan: post-render Demucs

v1.5 adds a separation stage after the render finishes. The chosen model is Demucs v4 with the htdemucs_ft weights, the four-stem fine-tuned variant, which separates a stereo mix into:

drums — kit and percussion.
bass — low-end, including 808s.
vocals — lead and stacked vocals when present.
other — everything that is not the first three.

The separation runs on the same H100 that handled the render, while the GPU is still warm. We measured the latency budget at 18 to 26 seconds for a 90-second bounce on a single H100 — well under the 30-second ceiling we set for the post-render stage. The composer shows a second progress bar only when the user opted into stems on the brief. Otherwise the page goes quiet at the end of the render, the same as v1.

File contract and delivery

Stems ship as a ZIP next to the bounce. One folder per job id, one WAV per stem, plus the original bounce, plus a manifest.json with the job metadata. The naming is positional and boring on purpose:

sonichaos-{jobId}/
  bounce.wav            # 24-bit, 44.1 kHz, stereo
  drums.wav             # 24-bit, 44.1 kHz, stereo
  bass.wav              # 24-bit, 44.1 kHz, stereo
  vocals.wav            # 24-bit, 44.1 kHz, stereo
  other.wav             # 24-bit, 44.1 kHz, stereo
  manifest.json         # prompt, persona, seed, model, lufs, stems

Sample rate stays at 44.1 kHz across stems and bounce. We do not upsample to 48 kHz on export because every conversion costs transient definition, and the v1 render is native 44.1 kHz. If your post chain needs 48 kHz, do the conversion once on your end with a known SRC.

Re-render one stem from a re-prompted persona

The other half of the v1.5 release is a one-stem re-render. After the separator returns four stems, you can pick one (say vocals), open a small re-prompt panel, swap the persona, and request a new vocal stem that the gateway then aligns to the original drum and bass tracks.

The alignment is mechanical, not generative. We re-render the chosen stem with the original tempo map and key as conditioning, then run a short phase-aligned crossfade against the timing grid of the original bounce. It works cleanly on vocals and other. It is a closer call on drums, where micro-timing carries the groove, so the v1.5 release will gate drums re-rendering behind a flag while we measure drift.

A short spec for the re-render call:

One stem per call. No multi-stem re-renders in v1.5.
The new stem inherits the original key and tempoMap.
The new stem gets its own job id and audit row.
The ZIP for the job becomes a versioned ZIP — v2.zip next to v1.zip — so the original take stays intact.

Why not generate stems natively

The obvious question, why not have the model output four stems in the first place, has a real answer. The diffusion stack is trained on full mixes. Training a stem-aware model would change the data contract, the loss, and the persona embedding. We would rather ship separation as a v1.5 feature on a known-good separator than push a stem-aware retrain into v1 and slip the launch.

A stem-aware model is on the longer roadmap, past v2. Until then, Demucs v4 on the warm GPU is the right answer.

Read the stems brief

← Back to all notes

Stems on the roadmap: what v1 ships, what v1.5 adds

Finish the production with licensed sound.

Stems on the roadmap: what v1 ships, what v1.5 adds