Skip to content

Commit 529d4f4

Browse files
committed
adding all we need for transcription
1 parent bbe2d9f commit 529d4f4

File tree

4 files changed

+961
-0
lines changed

4 files changed

+961
-0
lines changed

docs/whisper_transcription/README.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Whisper Transcription + Summarization + Diarization API
2+
3+
This project provides a high-performance pipeline for **audio/video transcription**, **speaker diarization**, and **summarization** using [Faster-Whisper](https://github.com/guillaumekln/faster-whisper), Hugging Face LLMs (e.g. Mistral), and [pyannote.audio](https://github.com/pyannote/pyannote-audio). It exposes a **FastAPI-based REST API** and supports CLI usage as well.
4+
5+
---
6+
7+
## Features
8+
9+
- Transcribes audio using **Faster-Whisper** (multi-GPU support)
10+
- Summarizes long transcripts using **Mistral-7B** as a default
11+
- Performs speaker diarization via **PyAnnote**
12+
- Optional denoising using **Demucs + Noisereduce**
13+
- Supports real-time **streaming API responses**
14+
- Works on common formats: `.flac`, `.wav`, `.mp3`, `.m4a`, `.aac`, `.ogg`, `.webm`, `.opus` `.mp4`, `.mp3`, `.mov`, `.mkv`, `.avi`, etc.
15+
16+
---
17+
18+
## Installation
19+
20+
### 1. Create virtual environment
21+
```bash
22+
python3 -m venv whisper_env
23+
source whisper_env/bin/activate
24+
```
25+
26+
### 2. Install PyTorch (with CUDA 12.1 for H100/A100)
27+
```bash
28+
pip install torch==2.2.2+cu121 torchaudio==2.2.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html
29+
```
30+
31+
### 3. Install requirements
32+
```bash
33+
pip install -r requirements.txt
34+
```
35+
36+
> Make sure to have `ffmpeg` installed on your system:
37+
```bash
38+
sudo apt install ffmpeg
39+
```
40+
41+
---
42+
43+
## Usage
44+
45+
### CLI Transcription & Summarization
46+
47+
```bash
48+
python faster_code_week1_v28.py \
49+
--input /path/to/audio_or_folder \
50+
--model medium \
51+
--output-dir output/ \
52+
--summarized-model mistralai/Mistral-7B-Instruct-v0.1 \
53+
--summary \
54+
--speaker \
55+
--denoise \
56+
--prop-decrease 0.7 \
57+
--hf-token YOUR_HUGGINGFACE_TOKEN \
58+
--streaming \
59+
--max-speakers 2 \
60+
--ground-truth ground_truth.txt
61+
```
62+
63+
**Optional flags:**
64+
65+
| Argument | Description |
66+
|---------------------|-----------------------------------------------------------------------------|
67+
| `--input` | **Required.** Path to input file or directory of audio/video. |
68+
| `--model` | Whisper model to use (`base`, `small`, `medium`, `large`, `turbo`). Auto-detects if not specified. |
69+
| `--output-dir` | Directory to store output files. Defaults to a timestamped folder. |
70+
| `--summarized-model`| Hugging Face or local LLM for summarization. Default: `Mistral-7B`. |
71+
| `--denoise` | Enable two-stage denoising (Demucs + noisereduce). |
72+
| `--prop-decrease` | Float [0.0–1.0]. Controls noise suppression. Default = 0.7 |
73+
| `--summary` | Enable summarization after transcription. |
74+
| `--speaker` | Enable speaker diarization using PyAnnote. |
75+
| `--streaming` | Stream results in real-time chunk-by-chunk. |
76+
| `--hf-token` | Hugging Face token for gated model access. |
77+
| `--max-speakers` | Limit the number of identified speakers. Optional. |
78+
| `--ground-truth` | Path to ground truth `.txt` for WER evaluation. Optional. |
79+
80+
---
81+
82+
### Start API Server
83+
84+
```bash
85+
uvicorn whisper_api_server:app --host 0.0.0.0 --port 8000
86+
```
87+
88+
### Example API Call
89+
90+
```bash
91+
curl -X POST http://<YOUR_IP>:8000/transcribe \
92+
-F "audio_file=@test.wav" \
93+
-F "model=medium" \
94+
-F "summary=true" \
95+
-F "speaker=true" \
96+
-F "denoise=false" \
97+
-F "streaming=true" \
98+
-F "hf_token=hf_xxx" \
99+
-F "max_speakers=2"
100+
```
101+
102+
---
103+
104+
## Outputs
105+
106+
For each input file, the pipeline generates:
107+
108+
- `*.txt` — Transcript with speaker labels and timestamps
109+
- `*.json` — Transcript + speaker segments + summary
110+
- `transcription_log_*.log` — Full debug log for reproducibility
111+
112+
---
113+
114+
## Hugging Face Token
115+
116+
To enable **speaker diarization**, accept the model terms at:
117+
[https://huggingface.co/pyannote/segmentation](https://huggingface.co/pyannote/segmentation)
118+
119+
Then generate a token at:
120+
[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
121+
122+
---
123+
124+
## Dependencies
125+
126+
Key Python packages:
127+
128+
- `faster-whisper`
129+
- `transformers`
130+
- `pyannote.audio`
131+
- `librosa`, `pydub`, `noisereduce`
132+
- `ffmpeg-python`, `demucs`
133+
- `fastapi`, `uvicorn`, `jiwer`
134+
135+
---
136+
137+
## Notes
138+
139+
- The API uses a **cached Whisper model per variant** for faster performance.
140+
- **Diarization is performed globally** over the entire audio, not per chunk.
141+
- **Denoising uses Demucs to isolate vocals**, which may be GPU-intensive.
142+
143+
---
144+
145+
146+

0 commit comments

Comments
 (0)