I Built a Podcast That Talks in My Own Voice

A text-to-speech engine running on my DGX Spark, narrating a whole essay podcast in a voice cloned from a single recording of me — and the Blackwell GPU fight it took to make it render three times faster than real time.

I have a podcast now. I don’t record it.

I write the script, hand it to a machine in my homelab, and a few minutes later there’s an episode — narrated, start to finish, in my own voice. Not a voice that sounds vaguely like me. My voice, cloned from a single recording I made into a microphone one afternoon. The show is a warm, single-host essay format, the kind where someone pulls up a chair and thinks out loud for twenty minutes. The host is me. The narrator is a text-to-speech model running on a GB10.

Here’s how it works, and the GPU fight it took to make it fast.

The model

The engine is VoxCPM2, an Apache-2.0 text-to-speech model that does zero-shot voice cloning — give it a few seconds of reference audio and it renders new text in that voice. It runs fully offline on the DGX Spark in my homelab. Nothing leaves the house, nothing hits a cloud API, and the voice model is built from recordings of exactly one person: me. That last part matters. Voice cloning is a technology with an obvious dark side, so the only voice this thing knows how to make is the one I gave it, for the one show it narrates.

The naive path works on day one. There’s a reference voxcpm package, you make a virtual environment, you point it at a WAV of your voice and a block of text, and out comes audio. The problem is speed. On the host path, the model renders at a real-time factor (RTF) of about 1.3 — meaning a thirty-minute episode takes roughly thirty-nine minutes of compute. Workable, but every script revision means another forty-minute wait. I wanted to iterate faster than that.

The Blackwell problem

The Spark’s GPU is a GB10 — sm_121, Blackwell architecture. It is genuinely fast hardware. It is also new enough that most of the prebuilt ML wheels you’d reach for have never heard of it.

The accelerated inference path I wanted relies on nano-vllm plus flash-attention. And flash-attn is exactly where it falls apart:

  • flash-attn won’t build from source on this box. The host ships CUDA 13.0; the torch build wants cu12.9. Beyond the version mismatch, FlashAttention-2 simply predates Blackwell — there’s no kernel for sm_121 to compile in the first place. You can sit through the build and still get nothing that runs.

So you stop trying to build it and go find someone who already did. NVIDIA’s container, nvcr.io/nvidia/vllm:25.12.post1-py3, ships flash-attn 2.7.4 prebuilt for sm_121. nano-vllm pins flash-attn with no specific version, so the prebuilt one satisfies it cleanly, and the image’s torch is recent enough that nothing tries to “helpfully” replace the NVIDIA build with a stock wheel. The whole acceleration story turns on using the right base image instead of fighting the compiler.

That decision creates three smaller fires, each of which cost me an evening:

  • torchaudio’s ABI doesn’t match. NVIDIA’s torch is built against a custom ABI, and the PyPI torchaudio won’t load against it. It turns out VoxCPM2 never actually calls torchaudio — it resamples through librosa — so the fix is to not have the real one at all. I dropped in a tiny shim package backed by soundfile and librosa that satisfies the import and nothing else.
  • spawn re-imports your script. nano-vllm spins up worker processes with spawn, which re-imports the entry module in each worker. Any code that isn’t guarded under if __name__ == "__main__": runs again in every worker — which, for model-loading code, is exactly the catastrophe it sounds like. Move everything under the guard.
  • cuDNN is missing a Blackwell engine. The AudioVAE that turns latents back into a waveform leans on conv_transpose1d, and cuDNN doesn’t yet ship an engine for it on this GPU. Setting torch.backends.cudnn.enabled = False makes the convolutions fall back to native CUDA — which works fine — while flash-attention keeps doing the heavy lifting in the attention layers, so the speedup survives.

The payoff for all of that:

TaskHost venvContainer (nano-vllm)Speedup
Plain TTSRTF ~1.25RTF 0.393.2×
My-voice renderRTF ~1.40RTF 0.403.5×

RTF below 1.0 means faster than real time. A thirty-minute episode now renders in about twelve minutes instead of thirty-nine — same model, same quality, roughly 3× faster — and that’s the difference between “I’ll re-render after dinner” and “let me just fix that one sentence and run it again.”

Script to mp3, in one command

An episode is one command:

./scripts/build_podcast.sh scripts/script_<topic>.txt <slug> "<Episode Title>" my_voice2.wav
# → episodes/<slug>/<slug>.mp3

That wraps the whole pipeline: take the script, render it in my voice on the GPU, run QA, master the audio, and emit a tagged mp3. A couple of things I learned the hard way are baked into the house style now:

  • Spell numbers out as words. “Eighty percent,” not “80%.” The model reads digits and symbols poorly, and worse, the QA step catches the mismatch and fails the render. Writing “twenty twenty-six” instead of “2026” is a small tax that pays for itself.
  • QA is a gate, not a suggestion. Every render gets transcribed and checked two ways: does the transcript match the script, and is the voice consistent across the whole episode (a similarity score that has to clear 0.93). If either fails, the episode doesn’t ship. This is what keeps a forty-minute render from going out with a garbled paragraph in the middle that I’d only notice on playback.
  • Match the loudness of your reference clip. The output comes out at roughly the level of the voice sample you cloned from. Record at a healthy level close to the mic, or every episode lands quiet and needs a normalization pass afterward.

The publishing step is deliberately not automated. The pipeline ends at a tagged mp3, and I upload it to Spotify by hand. Partly that’s a “human looks at it before the world does” checkpoint, and partly it’s that Spotify’s episode-description editor drops characters when you paste into it programmatically — the kind of janky integration that isn’t worth automating around for a once-a-week task.

Why bother

I could have recorded these episodes with my actual mouth. It would have been faster to set up and the result would be, arguably, more “authentic.”

But the interesting problem was never “make a podcast.” It was: can I run a real voice-cloning model, on my own hardware, in my own voice, fast enough to be a tool rather than a demo — on a GPU so new that half the software stack doesn’t admit it exists yet? The answer turned out to be yes, and the path there was a tour through exactly the kind of thing I like: a brand-new accelerator, a software ecosystem that hasn’t caught up to it, and a stack of small, specific incompatibilities that each have a real fix once you stop expecting the happy path to work.

The podcast is the artifact. The engine is the point.

You can listen to the show here. The voice is mine. The work was the fun part.