Feature Request: Improved Multi-Language Audio Transcription

## Problem

Whisper.cpp excels at transcribing audio in a single, specified language. However, when an audio input contains speech in multiple languages, the current transcription sometimes quality degrades significantly. This often manifests as:

* **Skipping speech segments:** Portions of the audio, particularly when language switches occur, are not transcribed at all.
* **Unrelated results:** The model may attempt to transcribe in the primary identified language, leading to nonsensical or incorrect text for segments in other languages.

## Proposed Solutions

I propose adding functionality to `whisper.cpp` to handle multi-language audio inputs. I've explored a couple of approaches and believe they offer promising avenues for improvement:

1.  **Segment-level Language Detection (VAD + Auto-Detect):**
    * **Concept:** Utilize a Voice Activity Detection (VAD) mechanism to segment the audio into smaller, speech-containing chunks. For each segment, independently perform language detection and then apply the transcription with the detected language.
    * **Current Efforts:** I have already experimented with this approach externally by using VAD to split segments and then performing auto-language detection on each segment before passing it to `whisper.cpp`. While this provides a workaround, integrating it directly into `whisper.cpp` would offer a more streamlined and optimized solution.

2.  **Constrained Language Probability for a List of Known Languages:**
    * **Concept:** Instead of full auto-detection on every segment, allow the user to provide a predefined list of possible languages present in the audio (e.g., `["en", "fr", "de"]`). For each speech segment, the model would then calculate probabilities *only* for these specified languages, selecting the most probable one for transcription. This could be particularly efficient if the number of potential languages is small.
    * **Initial Scope:** For an initial implementation, focusing on **bilingual audio** (e.g., providing two possible languages) would be a good starting point to demonstrate feasibility.

## Benefits

* Significantly improved transcription accuracy for multilingual audio.
* Wider applicability in diverse real-world use cases.
* Reduced need for external pre-processing steps.

I believe these approaches are technically feasible.

Happy to contribute a PR to implement this feature. I would appreciate any guidance or feedback on the preferred implementation strategy and integration points within the codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Improved Multi-Language Audio Transcription #3334

Problem

Proposed Solutions

Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Improved Multi-Language Audio Transcription #3334

Description

Problem

Proposed Solutions

Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions