I ran across this post on running Whisper WebUI under Docker a while back and had it up and running for a while. Something broke in a recent release, though, and I tend to prefer command-line tools for things, so I went looking for alternatives.
The tools Whisper WebUI runs under the hood have command-line equivalents available. In particular, there’s insanely-fast-whisper-cli. Getting it running wasn’t particularly difficult…if anything, it was easier than getting GPU compute running within Docker containers:
git clone https://github.com/ochen1/insanely-fast-whisper-cli
sudo mv insanely-fast-whisper-cli /opt
sudo chown -R $(whoami) /opt/insanely-fast-whisper-cli
python -m venv /opt/insanely-fast-whisper-cli
source /opt/insanely-fast-whisper-cli/bin/activate
pip install -r requirements.txt
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126
cat <<EOF | sudo tee /usr/local/bin/whisper-cli
#!/usr/bin/env bash
source /opt/insanely-fast-whisper-cli/bin/activate
python /opt/insanely-fast-whisper-cli/insanely-fast-whisper.py "$@"
EOF
sudo chmod +x /usr/local/bin/whisper-cli
This uses a downgraded torch (v2.7.0) that I need to use whisper-cli with my GeForce GTX 1070. If you have a newer card, you can probably leave out the pip install torch==2.7.0...
bit.
Once all this is in place, you can then use something like whisper-cli foo.avi
to produce foo.srt.
You might find sometimes that background music confuses Whisper. There’s another tool for that: vocal. Installation is even simpler:
sudo mkdir /opt/vocal
sudo chown -R $(whoami) /opt/vocal
python -m venv /opt/vocal
source /opt/vocal/bin/activate
pip install vocal
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126
cat <<EOF | sudo tee /usr/local/bin/vocali
#!/usr/bin/env bash
source /opt/vocal/bin/activate
vocali "$@"
EOF
sudo chmod +x /usr/local/bin/vocali
vocali -i in.mkv -o in.mp3
will produce a file with all of the background music stripped out. Vocals will be retained, as will anything spoken in a normal voice.