Microsoft’s VibeVoice runs local, clones a voice from a 10 second sample, and turns a script into a podcast-style conversation. You keep drafts on your machine. It handles multi-speaker scenes and long recordings, and it embeds a safety watermark. I set it up on a laptop in under an hour and started cutting founder intros and training clips without touching a studio.
You will clone the repo, create a Python environment, and run a quick help check. This keeps everything local. No signup. If you want GPU speed later, you can add CUDA on Linux or MPS on Apple Silicon.
bash
git clone https://github.com/microsoft/VibeVoice.git && cd VibeVoice
Create a virtual env and install Python deps. I used Python 3.10. If you are on macOS or Linux, this is quick. On Windows, use PowerShell and the Scripts\Activate.ps1 path.
bash
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
pip install -r requirements.txt
Verify the CLI is wired up.
bash
python -m vibevoice --help
You should see a usage block with commands for cloning from a 10s sample, generating single-speaker TTS, and multi-speaker dialog. If that shows, you are ready to feed it a short WAV and a script. Stop here if you just wanted the quick taste. Comment AI and I will send the full setup and batching tips.
Read the rest. Free.
One short email a week. Drop yours and the full guide unlocks below — instantly.
- The n8n workflow you can import
- The SQL schema you can paste
- Step-by-step setup
One short email a week. Unsubscribe anytime.
Already subscribed? Drop your email above (skip the name) — we'll let you back in instantly.
Want bespoke AI automation built for your business?
Book a free 30-min discovery call — we'll map the workflows worth automating, the tools that fit, and tell you straight up where the wins are.
Book a discovery call