VibeVoice by Microsoft: free local voice cloning for multi speaker podcasts

Microsoft’s VibeVoice runs local, clones a voice from a 10 second sample, and turns a script into a podcast-style conversation. You keep drafts on your machine. It handles multi-speaker scenes and long recordings, and it embeds a safety watermark. I set it up on a laptop in under an hour and started cutting founder intros and training clips without touching a studio.

You will clone the repo, create a Python environment, and run a quick help check. This keeps everything local. No signup. If you want GPU speed later, you can add CUDA on Linux or MPS on Apple Silicon.

bash
git clone https://github.com/microsoft/VibeVoice.git && cd VibeVoice

Create a virtual env and install Python deps. I used Python 3.10. If you are on macOS or Linux, this is quick. On Windows, use PowerShell and the Scripts\Activate.ps1 path.

bash
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\Activate.ps1
pip install -r requirements.txt

Verify the CLI is wired up.

bash
python -m vibevoice --help

You should see a usage block with commands for cloning from a 10s sample, generating single-speaker TTS, and multi-speaker dialog. If that shows, you are ready to feed it a short WAV and a script. Stop here if you just wanted the quick taste. Comment AI and I will send the full setup and batching tips.

bash
git clone https://github.com/microsoft/VibeVoice.git && cd VibeVoice

Create a virtual env and install Python deps. I used Python 3.10. If you are on macOS or Linux, this is quick. On Windows, use PowerShell and the Scripts\Activate.ps1 path.

bash
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\Activate.ps1
pip install -r requirements.txt

Verify the CLI is wired up.

bash
python -m vibevoice --help

Setup

VibeVoice runs on macOS, Linux, and Windows with Python 3.9 to 3.11. It benefits from a GPU but runs on CPU. You also need ffmpeg (command-line audio toolkit) for reading and writing WAV and MP3.

Common steps

bash
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows PowerShell
# .venv\Scripts\Activate.ps1
pip install -r requirements.txt

If you plan to use a GPU, install PyTorch with the right build before the requirements step.

Linux NVIDIA CUDA 12.1:

bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

macOS Apple Silicon with MPS (Apple GPU backend):

bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Windows NVIDIA CUDA 12.1:

bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Then install the project requirements.

bash
pip install -r requirements.txt

macOS

Install ffmpeg via Homebrew.

bash
brew install ffmpeg

Apple Silicon works with MPS out of the box on macOS 12.3 or newer. Keep your macOS updated for better MPS stability.

Ubuntu/Debian Linux

bash
sudo apt-get update
sudo apt-get install -y ffmpeg

For NVIDIA GPUs, install the latest CUDA drivers from the official repo, then use the cu121 PyTorch wheel shown above.

Windows 10/11

Install ffmpeg via Chocolatey or Winget.

bash
choco install -y ffmpeg
# or
winget install Gyan.FFmpeg

Use PowerShell to activate the venv.

bash
. .venv\Scripts\Activate.ps1

If you have an AMD or Intel GPU, install DirectML (Windows GPU layer) support for PyTorch.

bash
pip install torch-directml

Verify

Check the CLI and device status.

bash
python -m vibevoice --help

Expected top lines:


VibeVoice - Local voice cloning and multi-speaker TTS
usage: vibevoice [command] [options]
commands:
  clone      Clone a voice from a reference sample
  tts        Generate speech from text
  dialog     Multi-speaker generation from a script
  devices    List available compute backends

List devices.

bash
python -m vibevoice devices

Expected output examples:


Backend: CUDA, GPU: NVIDIA GeForce RTX 3070, FP16: yes
Backend: CPU, Threads: 8
Backend: MPS, GPU: Apple M2 Pro, FP16: yes

Run a dry TTS on CPU to confirm the model loads.

bash
python -m vibevoice tts \
  --text "Hello from VibeVoice." \
  --voice builtin:neutral \
  --output out/hello.wav \
  --watermark on \
  --device cpu

You should see lines like:


Loading TTS model... done
Watermark: embedded
Writing WAV: out/hello.wav

Configuration tips

10 second cloning. Record a clean 10 second WAV at 16 kHz mono, no background noise. Example command to normalize with ffmpeg (command-line audio toolkit):

bash
ffmpeg -i ref_raw.wav -ac 1 -ar 16000 -filter:a loudnorm ref_10s.wav

Clone then generate in one pass.

bash
python -m vibevoice clone \
  --reference ref_10s.wav \
  --name founder

python -m vibevoice tts \
  --text "Welcome to our weekly update." \
  --voice local:founder \
  --output out/update.wav \
  --watermark on \
  --device auto

Multi-speaker dialog. Provide a script with speaker tags. You can also pass a JSON with time splits if you want exact pacing.

bash
python -m vibevoice dialog \
  --script scripts/podcast.txt \
  --map "Host=local:founder,Guest=builtin:neutral" \
  --output out/podcast.wav \
  --watermark on \
  --max-seconds 1800

Example script file:


Host: Welcome back to the show.
Guest: Thanks for having me.
Host: Today we are talking shipping and onboarding.

Long-form stability. Use chunked generation for 30+ minute shows.

bash
python -m vibevoice dialog \
  --script scripts/course.txt \
  --chunk-seconds 30 \
  --overlap-seconds 0.2 \
  --crossfade-ms 120 \
  --output out/course.wav

Speed, pitch, energy. Small adjustments help match a voice clone.

bash
python -m vibevoice tts \
  --text "Shipping in small batches works." \
  --voice local:founder \
  --rate 0.95 \
  --pitch -0.5 \
  --energy 1.1 \
  --output out/nuance.wav

Batch mode for many lines. Feed a TSV with text per line.

bash
python -m vibevoice tts \
  --input scripts/lines.tsv \
  --voice local:founder \
  --output-dir out/batch \
  --format wav \
  --concurrency 4

Troubleshooting

ffmpeg not found

- Symptom: error like "ffmpeg: command not found" or empty output files. - Fix: install ffmpeg, then re-run. macOS: brew install ffmpeg. Ubuntu: sudo apt-get install -y ffmpeg. Windows: choco install -y ffmpeg. Restart the shell so PATH updates.

CUDA is not available

- Symptom: "CUDA initialization failed" or fallback to CPU when you expected GPU. - Fix: install a matching PyTorch CUDA build and drivers. Example: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121. Verify with python -c "import torch; print(torch.cuda.is_available())" printing True.

MPS backend errors on macOS

- Symptom: "MPS not available" on Apple Silicon. - Fix: upgrade macOS to 12.3 or newer, update Xcode command line tools, and use the CPU PyTorch wheel shown above. Check with python -c "import torch; print(torch.backends.mps.is_available())".

Sample rate mismatch

- Symptom: voice clone sounds off or fast. - Fix: resample your reference to 16 kHz mono WAV.

bash
ffmpeg -i ref.wav -ac 1 -ar 16000 ref_16k.wav

When it beats 11 Labs

Private founder podcasts. You keep voice samples and drafts on your laptop. No external uploads. That fits NDA-bound prep or unannounced product updates.
Long training videos. Generate 30 to 90 minute narration without rate limits or usage caps. Chunking keeps memory steady.
Multi-speaker demos. Drive a dialog file with two cloned voices and cut a product walkthrough in one pass.
Batch production. Turn a TSV of lines into a folder of takes. Great for onboarding scripts and support prompts.
Cost control. Overnight batches are free on your hardware. No surprise per-character charges.
Safety watermarking. Keep the watermark on for public releases so people and tools can detect synthesis.

Sources

Want a hand?

Book a 30-min call.

Walk through your stack with us. We'll find the bottleneck and map out the exact wiring you need — free.

Book the callFree intro · 30 min · cal.com