How to clone your voice using GPT-SoVITS for FREE
Here is how you can clone your voice using GPT soVITS. With only one minute of your voice data you can train a good TTS model to replicate your unique tone. To begin, simply scroll down to the Windows section, click on Download the integrated package, and wait as the 6–7 GB file transfers. Once downloaded, extract the ZIP to your desktop (you may need 7z—just search for “7z” if you don’t have it). After extraction, double‑click “go‑webUI.bat” to launch the web interface in Chrome; if it appears in Chinese, open the .bat file in Notepad and remove the “zh_cn” flag so it displays in English.
Next, prepare your audio for cloning by first copying your folder path into the Audio Slicer input and Open Speech Slicing. Once you see “Speech Slicing Finished,” navigate to the output folder’s slicer_opt subfolder to review the segmented clips. Open the Speech Denoising step to produce denoise_opt segments, then open Speech Recognition and let it transcribe your cleaned segments into denoiser_opt.list. Paste that list into the Speech‑to‑text‑Proofreading tool and use the Audio Labeling WebUI to align and correct any mismatches before saving your finalized transcript.
Finally, switch to the GPTOVITS‑TTS tab and assign any model name you like. In the Dataset Formatting tool, point the text labeling file to your denoise_opt.list (or the folder itself) and click Close Training Set One‑Click Formatting. Move to Fine‑tuning, launch SoVITS training, then start GPT training, and once that completes go to 1c‑inference. Choose the highest‑value model, enable Parallel Inference Version, and open the TTS Inference WebUI. Upload your denoised audio clips, input your primary reference text (editing the asr music file if needed), switch the language to English, and click Start Inference. Remember that initial training can take two to three hours on a CPU (much faster with a GPU), while short text‑to‑audio conversions only take seconds. When it’s done, play back your cloned voice—it may not be perfect with limited samples or if you’re not a native speaker, but you’ll see how simple it is to refine your model further.
Please do give it a try. Thank you very much.