Doppel
All seeds

How it works

Doppel matches the vibeof a seed track by combining cultural retrieval with audio-embedding rerank. Here’s the reasoning behind that design, the pipeline itself, and the diagnostic evidence that the audio leg earns its place.

Two designs died first

The architecture is the residue of killing two reasonable-looking approaches — one that broke on external reality, one that broke on real users — and keeping what each failure taught.

Dead end 1

Let an LLM read the audio

The first design had an LLM analyse BPM, key, and Spotify audio features. Two problems killed it: Spotify closed those audio endpoints to new apps in 2024, and more fundamentally, asking a language model to judge instrumentation asks it to do something it can’t — it has never heard the song. That’s the seed of the rule that survived to today: the LLM explains, it never ranks.

Dead end 2

Pre-embed a royalty-free corpus

The next design pre-embedded a free-to-use catalogue (FMA/Jamendo) and matched against it. It was algorithmically sound and a product failure: ask for something like a chart hit and you get thirty tracks by artists nobody has heard of. It satisfied the math and failed the user.

The wedge

Hybrid retrieve-then-rerank

The answer was to make two weak signals cover each other. Cultural sources (Last.fm, ListenBrainz) give cheap recall — what listeners treat as similar — and a CLAP audio model reranks for what actually sounds similar. Cultural recall keeps the results recognisable; the audio rerank keeps them perceptually honest. Neither leg is trustworthy alone.

What makes it shippable

Lazy, self-growing corpus

Rather than a weeks-long ETL, the engine embeds only the candidates a query actually surfaces (capped at 75) and caches the vectors in pgvector, so the corpus grows itself. That single cache-first decision is what lets a warm ~12s request and a cold ~12min one run on the exact same code path — the difference is just the cache-miss count.

The four-way combination

The wedge is doing four things at once that no single tool does together: cultural recall, perceptual audio scoring, controllable text vibe-steering, and a grounded rationale.

Spotify / Applecollaborative filtering — drifts toward what's already popular
Last.fmtaste-based neighbours, but no audio signal at all
Chosic / Spotalikethin wrappers over the Spotify graph
Maroofyaudio ML, but opaque — no rationale, no controllable steering

Where it doesn’t win:Doppel won’t beat Spotify for casual “play me something similar.” The wedge is deliberate discovery — “I love this specific song, what makes it feel this way, and what else shares that exact quality.” Naming where a system loses is part of describing what it’s for.

Decisions, with the road not taken

Each of these is a fork where the rejected option was reasonable — the note is why the other branch won.

RRF rank-fusionover raw-score fusion

Last.fm’s 0–1 match and ListenBrainz’s integer score are uncalibrated, so fuse on rank alone — 1/(k+rank), k=60.

Learned CLAP embeddingsover hand-crafted DSP features

Two songs at identical BPM and key can feel nothing alike (deep house vs garage rock). A learned embedding captures texture a feature vector misses.

Recording-level canonicalizationover work-level

Collapsing a live or acoustic take into the studio master surfaces matches the user didn’t mean. Only a same-master re-release is suppressed (audio ≥ 0.98 ∧ title token-set ≥ 0.90).

Dual-key dedupeover MBID alone

Verified MBID and the Deezer track id — because the same recording showed up twice under one Deezer id with different MBIDs.

Count-based gatesover a work-budget estimator

The cold/warm split is gated on the uncached-candidate count, deferring a fancier latency estimator until real query-log calibration data exists.

pgvectorover a dedicated vector DB

Postgres already holds the metadata, logs, and cache, so the vectors live there too — one datastore, no extra operational surface.

What’s deferred, named not hidden

Most of these fall out of the static-showcase architecture: with no public live backend, a whole class of hardening is scoped as deliberate judgment rather than built. Listing where the system isn’t finished is part of describing it honestly.

  • Non-enumerable poll handlesthe live job handle is a sequential rec-<int>; opaque tokens are a known follow-up
  • API auth + inbound rate limitingthere is no public endpoint today, so neither is built — they're scoped, not shipped
  • asyncpg connection-scopingneeded for real request concurrency; the single-worker path doesn't yet
  • Cultural-only seed-equivalencesame-artist neighbours (DNA. under HUMBLE.) can still appear — legitimately same-vibe by the recording-level design

The pipeline, end to end

One run_pipeline coroutine, cache-first. The same code path serves a warm ~12s request and a cold ~12min one — the only difference is how many candidates miss the pgvector cache and need embedding.

  1. Cultural recall
    Last.fm + ListenBrainz · ~200–300 raw candidates
  2. Dedupe + RRF fuse
    dual-key dedupe · reciprocal-rank fusion (k=60)
  3. Resolve top 75
    MusicBrainz canonicalize + Deezer verify · sequential, ~1 req/s
  4. Embed cache-misses
    CLAP · in-memory decode · bounded concurrency (sem=4)
  5. Score + fuse
    min-max-then-fuse · pure-numpy · α=0.7 / β=0.3
  6. Top 10
    audio-scored first · cultural backfill tail if short
  7. LLM explains
    one batched call · explains, never ranks
cultural recall audio rerank explanation only

Does the audio leg earn its keep?

These panels render straight from one frozen diagnostic run (eval-full-20260527-083852) over the full 19-seed benchmark set. It is a coverage-and-behaviour run, not precision/recall — there is no ground-truth “good vibe match” label, so nothing here measures or claims to beat any competitor. It only shows what the engine does.

19/19seeds audio-scored across 8 genres·median resolve found-ratio0.987

Audio similarity holds across the whole map

diagnostic · no ground truth

Raw CLAP audio cosine for each genre's top-10 neighbours. Jazz clusters tightest and highest; electronic spreads lowest — but every genre lands well inside the music band, which is the cross-genre coverage claim.

Pop
0.8350.919
R&B
0.7560.870
Hip-hop
0.6850.930
Indie
0.7590.946
Electronic
0.4900.883
Jazz
0.9040.956
Pre-2000
0.6770.851
Non-English
0.7860.906

axis: cosine 0.30 → 1.00 · the perceptual music band

The two legs live in different ranges

diagnostic · no ground truth

Audio cosine clusters high; the text encoder is a deliberately weak signal that clusters low. They barely overlap — which is exactly why fusion min-max-normalizes each leg within the batch before weighting (α=0.7 / β=0.3). You can't fuse raw values on different scales.

Audio cosine
0.4900.956
Vibe-text cosine
0.1500.372

Same 0.30–1.00 axis. The gap between the bars is the whole argument for normalize-then-fuse.

CLAP reshuffles the cultural shortlist

diagnostic · no ground truth

Comparing the pure cultural (RRF) order to the CLAP-reranked order at k=10: the two share a median of just 0.2 of their top 10 (range 0.0–0.5), with a median rank displacement of 3.4 places (range 2.4–4.4). The audio leg is doing real work — it isn't a pass-through of the cultural ranking.

0.2
median top-10 overlap · RRF vs CLAP order
3.4
median rank displacement · places moved
Take Five · The Dave Brubeck Quartetoverlap 0.4 · disp 4
  1. 1Alphanumeric — Lee Konitz
  2. 2Red Pepper Blues — Art Pepper
  3. 3Three to Get Ready — Dave Brubeck
HUMBLE. · Kendrick Lamaroverlap 0.3 · disp 2.6
  1. 1DNA. — Kendrick Lamar
  2. 2Magnolia — Playboi Carti
  3. 3Stir Fry — Migos
Strobe · deadmau5overlap 0.3 · disp 3.6
  1. 1Opus — Eric Prydz
  2. 2Create — OVERWERK
  3. 3Virus (How About Now) — Martin Garrix