Skip to content

Feature Request: Improved Multi-Language Audio Transcription #3334

@ilhamsyahids

Description

@ilhamsyahids

Problem

Whisper.cpp excels at transcribing audio in a single, specified language. However, when an audio input contains speech in multiple languages, the current transcription sometimes quality degrades significantly. This often manifests as:

  • Skipping speech segments: Portions of the audio, particularly when language switches occur, are not transcribed at all.
  • Unrelated results: The model may attempt to transcribe in the primary identified language, leading to nonsensical or incorrect text for segments in other languages.

Proposed Solutions

I propose adding functionality to whisper.cpp to handle multi-language audio inputs. I've explored a couple of approaches and believe they offer promising avenues for improvement:

  1. Segment-level Language Detection (VAD + Auto-Detect):

    • Concept: Utilize a Voice Activity Detection (VAD) mechanism to segment the audio into smaller, speech-containing chunks. For each segment, independently perform language detection and then apply the transcription with the detected language.
    • Current Efforts: I have already experimented with this approach externally by using VAD to split segments and then performing auto-language detection on each segment before passing it to whisper.cpp. While this provides a workaround, integrating it directly into whisper.cpp would offer a more streamlined and optimized solution.
  2. Constrained Language Probability for a List of Known Languages:

    • Concept: Instead of full auto-detection on every segment, allow the user to provide a predefined list of possible languages present in the audio (e.g., ["en", "fr", "de"]). For each speech segment, the model would then calculate probabilities only for these specified languages, selecting the most probable one for transcription. This could be particularly efficient if the number of potential languages is small.
    • Initial Scope: For an initial implementation, focusing on bilingual audio (e.g., providing two possible languages) would be a good starting point to demonstrate feasibility.

Benefits

  • Significantly improved transcription accuracy for multilingual audio.
  • Wider applicability in diverse real-world use cases.
  • Reduced need for external pre-processing steps.

I believe these approaches are technically feasible.

Happy to contribute a PR to implement this feature. I would appreciate any guidance or feedback on the preferred implementation strategy and integration points within the codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions