How it works

Doppel matches the vibeof a seed track by combining cultural retrieval with audio-embedding rerank. Here’s the reasoning behind that design, the pipeline itself, and the diagnostic evidence that the audio leg earns its place.

Two designs died first

The architecture is the residue of killing two reasonable-looking approaches — one that broke on external reality, one that broke on real users — and keeping what each failure taught.

Dead end 1

Let an LLM read the audio

The first design had an LLM analyse BPM, key, and Spotify audio features. Two problems killed it: Spotify closed those audio endpoints to new apps in 2024, and more fundamentally, asking a language model to judge instrumentation asks it to do something it can’t — it has never heard the song. That’s the seed of the rule that survived to today: the LLM explains, it never ranks.

Dead end 2

Pre-embed a royalty-free corpus

The next design pre-embedded a free-to-use catalogue (FMA/Jamendo) and matched against it. It was algorithmically sound and a product failure: ask for something like a chart hit and you get thirty tracks by artists nobody has heard of. It satisfied the math and failed the user.

The wedge

Hybrid retrieve-then-rerank

The answer was to make two weak signals cover each other. Cultural sources (Last.fm, ListenBrainz) give cheap recall — what listeners treat as similar — and a CLAP audio model reranks for what actually sounds similar. Cultural recall keeps the results recognisable; the audio rerank keeps them perceptually honest. Neither leg is trustworthy alone.

What makes it shippable

Lazy, self-growing corpus

Rather than a weeks-long ETL, the engine embeds only the candidates a query actually surfaces (capped at 75) and caches the vectors in pgvector, so the corpus grows itself. That single cache-first decision is what lets a warm ~12s request and a cold ~12min one run on the exact same code path — the difference is just the cache-miss count.

The four-way combination

The wedge is doing four things at once that no single tool does together: cultural recall, perceptual audio scoring, controllable text vibe-steering, and a grounded rationale.

Spotify / Applecollaborative filtering — drifts toward what's already popular

Last.fmtaste-based neighbours, but no audio signal at all

Chosic / Spotalikethin wrappers over the Spotify graph

Maroofyaudio ML, but opaque — no rationale, no controllable steering

Where it doesn’t win:Doppel won’t beat Spotify for casual “play me something similar.” The wedge is deliberate discovery — “I love this specific song, what makes it feel this way, and what else shares that exact quality.” Naming where a system loses is part of describing what it’s for.

Decisions, with the road not taken

Each of these is a fork where the rejected option was reasonable — the note is why the other branch won.

RRF rank-fusionover raw-score fusion

Last.fm’s 0–1 match and ListenBrainz’s integer score are uncalibrated, so fuse on rank alone — 1/(k+rank), k=60.

Learned CLAP embeddingsover hand-crafted DSP features

Two songs at identical BPM and key can feel nothing alike (deep house vs garage rock). A learned embedding captures texture a feature vector misses.

Recording-level canonicalizationover work-level

Collapsing a live or acoustic take into the studio master surfaces matches the user didn’t mean. Only a same-master re-release is suppressed (audio ≥ 0.98 ∧ title token-set ≥ 0.90).

Dual-key dedupeover MBID alone

Verified MBID and the Deezer track id — because the same recording showed up twice under one Deezer id with different MBIDs.

Count-based gatesover a work-budget estimator

The cold/warm split is gated on the uncached-candidate count, deferring a fancier latency estimator until real query-log calibration data exists.

pgvectorover a dedicated vector DB

Postgres already holds the metadata, logs, and cache, so the vectors live there too — one datastore, no extra operational surface.

What’s deferred, named not hidden

Most of these fall out of the static-showcase architecture: with no public live backend, a whole class of hardening is scoped as deliberate judgment rather than built. Listing where the system isn’t finished is part of describing it honestly.

Non-enumerable poll handlesthe live job handle is a sequential rec-<int>; opaque tokens are a known follow-up
API auth + inbound rate limitingthere is no public endpoint today, so neither is built — they're scoped, not shipped
asyncpg connection-scopingneeded for real request concurrency; the single-worker path doesn't yet
Cultural-only seed-equivalencesame-artist neighbours (DNA. under HUMBLE.) can still appear — legitimately same-vibe by the recording-level design

The pipeline, end to end

One run_pipeline coroutine, cache-first. The same code path serves a warm ~12s request and a cold ~12min one — the only difference is how many candidates miss the pgvector cache and need embedding.

Cultural recall
Last.fm + ListenBrainz · ~200–300 raw candidates
Dedupe + RRF fuse
dual-key dedupe · reciprocal-rank fusion (k=60)
Resolve top 75
MusicBrainz canonicalize + Deezer verify · sequential, ~1 req/s
Embed cache-misses
CLAP · in-memory decode · bounded concurrency (sem=4)
Score + fuse
min-max-then-fuse · pure-numpy · α=0.7 / β=0.3
Top 10
audio-scored first · cultural backfill tail if short
LLM explains
one batched call · explains, never ranks

cultural recall audio rerank explanation only

Does the audio leg earn its keep?

These panels render straight from one frozen diagnostic run (eval-full-20260527-083852) over the full 19-seed benchmark set. It is a coverage-and-behaviour run, not precision/recall — there is no ground-truth “good vibe match” label, so nothing here measures or claims to beat any competitor. It only shows what the engine does.

19/19seeds audio-scored across 8 genres·median resolve found-ratio0.987

Audio similarity holds across the whole map

diagnostic · no ground truth

Raw CLAP audio cosine for each genre's top-10 neighbours. Jazz clusters tightest and highest; electronic spreads lowest — but every genre lands well inside the music band, which is the cross-genre coverage claim.

Pop

0.835–0.919

R&B

0.756–0.870

Hip-hop

0.685–0.930

Indie

0.759–0.946

Electronic

0.490–0.883

Jazz

0.904–0.956

Pre-2000

0.677–0.851

Non-English

0.786–0.906

axis: cosine 0.30 → 1.00 · the perceptual music band

The two legs live in different ranges

diagnostic · no ground truth

Audio cosine clusters high; the text encoder is a deliberately weak signal that clusters low. They barely overlap — which is exactly why fusion min-max-normalizes each leg within the batch before weighting (α=0.7 / β=0.3). You can't fuse raw values on different scales.

Audio cosine

0.490–0.956

Vibe-text cosine

0.150–0.372

Same 0.30–1.00 axis. The gap between the bars is the whole argument for normalize-then-fuse.

CLAP reshuffles the cultural shortlist

diagnostic · no ground truth

Comparing the pure cultural (RRF) order to the CLAP-reranked order at k=10: the two share a median of just 0.2 of their top 10 (range 0.0–0.5), with a median rank displacement of 3.4 places (range 2.4–4.4). The audio leg is doing real work — it isn't a pass-through of the cultural ranking.

0.2

median top-10 overlap · RRF vs CLAP order

3.4

median rank displacement · places moved

Take Five · The Dave Brubeck Quartetoverlap 0.4 · disp 4

1Alphanumeric — Lee Konitz
2Red Pepper Blues — Art Pepper
3Three to Get Ready — Dave Brubeck

HUMBLE. · Kendrick Lamaroverlap 0.3 · disp 2.6

1DNA. — Kendrick Lamar
2Magnolia — Playboi Carti
3Stir Fry — Migos

Strobe · deadmau5overlap 0.3 · disp 3.6

1Opus — Eric Prydz
2Create — OVERWERK
3Virus (How About Now) — Martin Garrix