Speech-to-Text (Whisper) và Text-to-Speech: kiến trúc và use case?

Speech-to-Text (STT / ASR — Automatic Speech Recognition) và Text-to-Speech (TTS) là 2 modality audio quan trọng trong AI app.

Whisper (OpenAI, 2022) — de-facto STT model mở:
- Kiến trúc: encoder-decoder Transformer. Audio → log-mel spectrogram → encoder → decoder sinh text token.
- Multi-task training: 680K giờ multilingual audio-text pair từ internet. Train cùng lúc cho transcribe, translate, language ID, voice activity detection.
- Multilingual — 99 ngôn ngữ (gồm tiếng Việt).
- Robust — trained trên data noisy web → handle accent, background noise, technical jargon tốt.
- Model sizes: tiny (39M) → base → small → medium → large (1.5B), large-v3. Precision vs speed trade-off.
- Limitations: hallucinate khi im lặng dài, sai với heavy accent / overlapping speakers, không realtime natively (chunk 30s).

Whisper variants / alternatives:
- Whisper-faster (CTranslate2) — 4x faster, same quality.
- Distil-Whisper — distilled, 6x faster.
- Whisper-Turbo (OpenAI 2024) — smaller + 8x faster than large.
- AssemblyAI Universal-2 — commercial, realtime, diarization.
- Deepgram Nova — realtime, enterprise.
- Azure Speech, Google Speech-to-Text — managed.
- Gemini — có STT tích hợp trong multi-modal.
- wav2vec 2.0 (Meta), SeamlessM4T (Meta multi-modal translation).

Realtime STT (cho voice assistant):
- Chunk audio → streaming STT → partial transcript update → finalize.
- Tools: Deepgram, AssemblyAI Streaming, Whisper-Live, Pipecat.
- Key metric: Word Error Rate (WER), latency (first word, finalize).

Text-to-Speech (TTS):

Kiến trúc hiện đại:
- Autoregressive TTS (Tacotron, VALL-E, Tortoise) — text → mel-spectrogram từng frame → vocoder.
- Non-autoregressive (FastSpeech, StyleTTS) — parallel, nhanh hơn.
- Diffusion TTS (NaturalSpeech 3, VoiceBox) — chất lượng cao nhất.
- End-to-end LLM-based — OpenAI gpt-4o-audio, Gemini Live — unified audio I/O.

Voice cloning — VALL-E, XTTS: với 3-10s audio reference → clone voice cho text bất kỳ. Raise deepfake concern → cần watermark + consent.

TTS providers 2025:
- OpenAI TTS — 11 voice, rẻ, quality tốt.
- ElevenLabs — quality cao nhất, voice cloning, multi-lingual.
- Azure Speech / Google TTS — enterprise, nhiều voice.
- PlayHT, Cartesia Sonic — realtime, low latency.
- Coqui XTTS-v2 — open source.
- Kokoro — open, lightweight.

Use cases thực tế:

STT:
- Voice assistant: STT → LLM → TTS pipeline.
- Meeting transcription + summary (Otter.ai, Fathom, Fireflies).
- Call center analytics — transcribe call, sentiment, compliance check.
- Accessibility — caption livestream.
- Content creation — podcast transcribe → blog.
- Voice search.

TTS:
- Audiobook narration.
- IVR phone system.
- Screen reader accessibility.
- Language learning pronunciation.
- Game / video narration.
- Voice AI agent (Vapi, Retell, Bland).

Challenges:
- Latency cho voice assistant — target < 500ms round-trip (STT + LLM + TTS).
- Interrupt handling — user ngắt lời giữa TTS → cancel.
- Turn-taking — voice activity detection, endpointing.
- Emotion, prosody — TTS phát âm đúng tone context.
- Noise robustness — STT trong môi trường ồn.
- Privacy — audio có giọng nói, PII sensitive.

Pipeline voice AI agent realtime:

Mic → VAD (detect speech) → STT streaming → 
     buffer câu hoàn chỉnh → LLM → 
     TTS streaming → Speaker output

+ Interrupt handler (user bắt đầu nói → cancel TTS)
+ Context memory (conversation history)
+ End-of-turn detection (model quyết định khi nào đủ context để respond)

End-to-end alternatives (emerging 2024-2025):
- GPT-4o Realtime API — audio in, audio out, không cần split STT+LLM+TTS, latency < 300ms.
- Gemini Live — tương tự, multi-modal.
- Advantage: ít moving parts, preserve prosody/emotion.
- Disadvantage: less control, black box.

Speech-to-Text (STT / ASR — Automatic Speech Recognition) and Text-to-Speech (TTS) are two important audio modalities in AI apps.

Whisper (OpenAI, 2022) — de-facto open STT model:
- Architecture: encoder-decoder Transformer. Audio → log-mel spectrogram → encoder → decoder emits text tokens.
- Multi-task training: 680K hours of multilingual audio-text pairs from the internet. Simultaneously trained for transcribe, translate, language ID, voice activity detection.
- Multilingual — 99 languages (including Vietnamese).
- Robust — trained on noisy web data → handles accents, background noise, technical jargon well.
- Model sizes: tiny (39M) → base → small → medium → large (1.5B), large-v3. Precision vs speed trade-off.
- Limitations: hallucinates during long silences, struggles with heavy accents / overlapping speakers, not natively realtime (30-second chunks).

Whisper variants / alternatives:
- Whisper-faster (CTranslate2) — 4x faster, same quality.
- Distil-Whisper — distilled, 6x faster.
- Whisper-Turbo (OpenAI 2024) — smaller + 8x faster than large.
- AssemblyAI Universal-2 — commercial, realtime, diarization.
- Deepgram Nova — realtime, enterprise.
- Azure Speech, Google Speech-to-Text — managed.
- Gemini — integrated STT in multi-modal.
- wav2vec 2.0 (Meta), SeamlessM4T (Meta multi-modal translation).

Realtime STT (for voice assistants):
- Chunk audio → streaming STT → update partial transcript → finalize.
- Tools: Deepgram, AssemblyAI Streaming, Whisper-Live, Pipecat.
- Key metrics: Word Error Rate (WER), latency (first word, finalize).

Text-to-Speech (TTS):

Modern architectures:
- Autoregressive TTS (Tacotron, VALL-E, Tortoise) — text → mel-spectrogram frame by frame → vocoder.
- Non-autoregressive (FastSpeech, StyleTTS) — parallel, faster.
- Diffusion TTS (NaturalSpeech 3, VoiceBox) — highest quality.
- End-to-end LLM-based — OpenAI gpt-4o-audio, Gemini Live — unified audio I/O.

Voice cloning — VALL-E, XTTS: 3–10s reference audio → clone the voice for any text. Raises deepfake concerns → needs watermarking + consent.

2025 TTS providers:
- OpenAI TTS — 11 voices, cheap, good quality.
- ElevenLabs — highest quality, voice cloning, multi-lingual.
- Azure Speech / Google TTS — enterprise, many voices.
- PlayHT, Cartesia Sonic — realtime, low latency.
- Coqui XTTS-v2 — open source.
- Kokoro — open, lightweight.

Real-world use cases:

STT:
- Voice assistants: STT → LLM → TTS pipeline.
- Meeting transcription + summary (Otter.ai, Fathom, Fireflies).
- Call-center analytics — transcribe calls, sentiment, compliance checks.
- Accessibility — captioning livestreams.
- Content creation — transcribe podcasts → blog posts.
- Voice search.

TTS:
- Audiobook narration.
- IVR phone systems.
- Accessibility screen readers.
- Language learning pronunciation.
- Game / video narration.
- Voice AI agents (Vapi, Retell, Bland).

Challenges:
- Latency for voice assistants — target < 500ms round-trip (STT + LLM + TTS).
- Interrupt handling — user interrupts mid-TTS → cancel.
- Turn-taking — voice activity detection, endpointing.
- Emotion, prosody — TTS needs to hit context-appropriate tone.
- Noise robustness — STT in noisy environments.
- Privacy — audio contains voice, sensitive PII.

Realtime voice AI agent pipeline:

Mic → VAD (detect speech) → streaming STT → 
     buffer finished utterance → LLM → 
     streaming TTS → speaker output

+ Interrupt handler (user starts speaking → cancel TTS)
+ Context memory (conversation history)
+ End-of-turn detection (model decides when to respond)

End-to-end alternatives (emerging 2024–2025):
- GPT-4o Realtime API — audio in, audio out, no split STT+LLM+TTS, latency < 300ms.
- Gemini Live — similar, multi-modal.
- Pros: fewer moving parts, preserves prosody/emotion.
- Cons: less control, black box.

Xem toàn bộ AI Engineering cùng filter theo level & chủ đề con.

Mở danh sách AI Engineering

Speech-to-Text (Whisper) và Text-to-Speech: kiến trúc và use case?

Speech-to-Text (STT / ASR — Automatic Speech Recognition) và Text-to-Speech (TTS) là 2 modality audio quan trọng trong AI app.

Text-to-Speech (TTS):

Voice cloning — VALL-E, XTTS: với 3-10s audio reference → clone voice cho text bất kỳ. Raise deepfake concern → cần watermark + consent.

Use cases thực tế:

TTS:
- Audiobook narration.
- IVR phone system.
- Screen reader accessibility.
- Language learning pronunciation.
- Game / video narration.
- Voice AI agent (Vapi, Retell, Bland).

Pipeline voice AI agent realtime:

Mic → VAD (detect speech) → STT streaming → 
     buffer câu hoàn chỉnh → LLM → 
     TTS streaming → Speaker output

+ Interrupt handler (user bắt đầu nói → cancel TTS)
+ Context memory (conversation history)
+ End-of-turn detection (model quyết định khi nào đủ context để respond)

Speech-to-Text (STT / ASR — Automatic Speech Recognition) and Text-to-Speech (TTS) are two important audio modalities in AI apps.

Whisper variants / alternatives:
- Whisper-faster (CTranslate2) — 4x faster, same quality.
- Distil-Whisper — distilled, 6x faster.
- Whisper-Turbo (OpenAI 2024) — smaller + 8x faster than large.
- AssemblyAI Universal-2 — commercial, realtime, diarization.
- Deepgram Nova — realtime, enterprise.
- Azure Speech, Google Speech-to-Text — managed.
- Gemini — integrated STT in multi-modal.
- wav2vec 2.0 (Meta), SeamlessM4T (Meta multi-modal translation).

Text-to-Speech (TTS):

Voice cloning — VALL-E, XTTS: 3–10s reference audio → clone the voice for any text. Raises deepfake concerns → needs watermarking + consent.

Real-world use cases:

TTS:
- Audiobook narration.
- IVR phone systems.
- Accessibility screen readers.
- Language learning pronunciation.
- Game / video narration.
- Voice AI agents (Vapi, Retell, Bland).

Realtime voice AI agent pipeline:

Mic → VAD (detect speech) → streaming STT → 
     buffer finished utterance → LLM → 
     streaming TTS → speaker output

+ Interrupt handler (user starts speaking → cancel TTS)
+ Context memory (conversation history)
+ End-of-turn detection (model decides when to respond)

Xem toàn bộ AI Engineering cùng filter theo level & chủ đề con.

Mở danh sách AI Engineering