How it works
Doppel matches the vibeof a seed track by combining cultural retrieval with audio-embedding rerank. Here’s the reasoning behind that design, the pipeline itself, and the diagnostic evidence that the audio leg earns its place.
Two designs died first
The architecture is the residue of killing two reasonable-looking approaches — one that broke on external reality, one that broke on real users — and keeping what each failure taught.
Let an LLM read the audio
The first design had an LLM analyse BPM, key, and Spotify audio features. Two problems killed it: Spotify closed those audio endpoints to new apps in 2024, and more fundamentally, asking a language model to judge instrumentation asks it to do something it can’t — it has never heard the song. That’s the seed of the rule that survived to today: the LLM explains, it never ranks.
Pre-embed a royalty-free corpus
The next design pre-embedded a free-to-use catalogue (FMA/Jamendo) and matched against it. It was algorithmically sound and a product failure: ask for something like a chart hit and you get thirty tracks by artists nobody has heard of. It satisfied the math and failed the user.
Hybrid retrieve-then-rerank
The answer was to make two weak signals cover each other. Cultural sources (Last.fm, ListenBrainz) give cheap recall — what listeners treat as similar — and a CLAP audio model reranks for what actually sounds similar. Cultural recall keeps the results recognisable; the audio rerank keeps them perceptually honest. Neither leg is trustworthy alone.
Lazy, self-growing corpus
Rather than a weeks-long ETL, the engine embeds only the candidates a query actually surfaces (capped at 75) and caches the vectors in pgvector, so the corpus grows itself. That single cache-first decision is what lets a warm ~12s request and a cold ~12min one run on the exact same code path — the difference is just the cache-miss count.
The four-way combination
The wedge is doing four things at once that no single tool does together: cultural recall, perceptual audio scoring, controllable text vibe-steering, and a grounded rationale.
Where it doesn’t win:Doppel won’t beat Spotify for casual “play me something similar.” The wedge is deliberate discovery — “I love this specific song, what makes it feel this way, and what else shares that exact quality.” Naming where a system loses is part of describing what it’s for.
Decisions, with the road not taken
Each of these is a fork where the rejected option was reasonable — the note is why the other branch won.
Last.fm’s 0–1 match and ListenBrainz’s integer score are uncalibrated, so fuse on rank alone — 1/(k+rank), k=60.
Two songs at identical BPM and key can feel nothing alike (deep house vs garage rock). A learned embedding captures texture a feature vector misses.
Collapsing a live or acoustic take into the studio master surfaces matches the user didn’t mean. Only a same-master re-release is suppressed (audio ≥ 0.98 ∧ title token-set ≥ 0.90).
Verified MBID and the Deezer track id — because the same recording showed up twice under one Deezer id with different MBIDs.
The cold/warm split is gated on the uncached-candidate count, deferring a fancier latency estimator until real query-log calibration data exists.
Postgres already holds the metadata, logs, and cache, so the vectors live there too — one datastore, no extra operational surface.
What’s deferred, named not hidden
Most of these fall out of the static-showcase architecture: with no public live backend, a whole class of hardening is scoped as deliberate judgment rather than built. Listing where the system isn’t finished is part of describing it honestly.
- Non-enumerable poll handlesthe live job handle is a sequential rec-<int>; opaque tokens are a known follow-up
- API auth + inbound rate limitingthere is no public endpoint today, so neither is built — they're scoped, not shipped
- asyncpg connection-scopingneeded for real request concurrency; the single-worker path doesn't yet
- Cultural-only seed-equivalencesame-artist neighbours (DNA. under HUMBLE.) can still appear — legitimately same-vibe by the recording-level design
The pipeline, end to end
One run_pipeline coroutine, cache-first. The same code path serves a warm ~12s request and a cold ~12min one — the only difference is how many candidates miss the pgvector cache and need embedding.
- Cultural recallLast.fm + ListenBrainz · ~200–300 raw candidates
- Dedupe + RRF fusedual-key dedupe · reciprocal-rank fusion (k=60)
- Resolve top 75MusicBrainz canonicalize + Deezer verify · sequential, ~1 req/s
- Embed cache-missesCLAP · in-memory decode · bounded concurrency (sem=4)
- Score + fusemin-max-then-fuse · pure-numpy · α=0.7 / β=0.3
- Top 10audio-scored first · cultural backfill tail if short
- LLM explainsone batched call · explains, never ranks
Does the audio leg earn its keep?
These panels render straight from one frozen diagnostic run (eval-full-20260527-083852) over the full 19-seed benchmark set. It is a coverage-and-behaviour run, not precision/recall — there is no ground-truth “good vibe match” label, so nothing here measures or claims to beat any competitor. It only shows what the engine does.
Audio similarity holds across the whole map
diagnostic · no ground truthRaw CLAP audio cosine for each genre's top-10 neighbours. Jazz clusters tightest and highest; electronic spreads lowest — but every genre lands well inside the music band, which is the cross-genre coverage claim.
axis: cosine 0.30 → 1.00 · the perceptual music band
The two legs live in different ranges
diagnostic · no ground truthAudio cosine clusters high; the text encoder is a deliberately weak signal that clusters low. They barely overlap — which is exactly why fusion min-max-normalizes each leg within the batch before weighting (α=0.7 / β=0.3). You can't fuse raw values on different scales.
Same 0.30–1.00 axis. The gap between the bars is the whole argument for normalize-then-fuse.
CLAP reshuffles the cultural shortlist
diagnostic · no ground truthComparing the pure cultural (RRF) order to the CLAP-reranked order at k=10: the two share a median of just 0.2 of their top 10 (range 0.0–0.5), with a median rank displacement of 3.4 places (range 2.4–4.4). The audio leg is doing real work — it isn't a pass-through of the cultural ranking.
- 1Alphanumeric — Lee Konitz
- 2Red Pepper Blues — Art Pepper
- 3Three to Get Ready — Dave Brubeck
- 1DNA. — Kendrick Lamar
- 2Magnolia — Playboi Carti
- 3Stir Fry — Migos
- 1Opus — Eric Prydz
- 2Create — OVERWERK
- 3Virus (How About Now) — Martin Garrix