Speech-to-Text (STT / ASR — Automatic Speech Recognition) và Text-to-Speech (TTS) là 2 modality audio quan trọng trong AI app.
Whisper (OpenAI, 2022) — de-facto STT model mở:
- Kiến trúc: encoder-decoder Transformer. Audio → log-mel spectrogram → encoder → decoder sinh text token.
- Multi-task training: 680K giờ multilingual audio-text pair từ internet. Train cùng lúc cho transcribe, translate, language ID, voice activity detection.
- Multilingual — 99 ngôn ngữ (gồm tiếng Việt).
- Robust — trained trên data noisy web → handle accent, background noise, technical jargon tốt.
- Model sizes: tiny (39M) → base → small → medium → large (1.5B), large-v3. Precision vs speed trade-off.
- Limitations: hallucinate khi im lặng dài, sai với heavy accent / overlapping speakers, không realtime natively (chunk 30s).
Whisper variants / alternatives:
- Whisper-faster (CTranslate2) — 4x faster, same quality.
- Distil-Whisper — distilled, 6x faster.
- Whisper-Turbo (OpenAI 2024) — smaller + 8x faster than large.
- AssemblyAI Universal-2 — commercial, realtime, diarization.
- Deepgram Nova — realtime, enterprise.
- Azure Speech, Google Speech-to-Text — managed.
- Gemini — có STT tích hợp trong multi-modal.
- wav2vec 2.0 (Meta), SeamlessM4T (Meta multi-modal translation).
Realtime STT (cho voice assistant):
- Chunk audio → streaming STT → partial transcript update → finalize.
- Tools: Deepgram, AssemblyAI Streaming, Whisper-Live, Pipecat.
- Key metric: Word Error Rate (WER), latency (first word, finalize).
Text-to-Speech (TTS):
Kiến trúc hiện đại:
- Autoregressive TTS (Tacotron, VALL-E, Tortoise) — text → mel-spectrogram từng frame → vocoder.
- Non-autoregressive (FastSpeech, StyleTTS) — parallel, nhanh hơn.
- Diffusion TTS (NaturalSpeech 3, VoiceBox) — chất lượng cao nhất.
- End-to-end LLM-based — OpenAI gpt-4o-audio, Gemini Live — unified audio I/O.
Voice cloning — VALL-E, XTTS: với 3-10s audio reference → clone voice cho text bất kỳ. Raise deepfake concern → cần watermark + consent.
TTS providers 2025:
- OpenAI TTS — 11 voice, rẻ, quality tốt.
- ElevenLabs — quality cao nhất, voice cloning, multi-lingual.
- Azure Speech / Google TTS — enterprise, nhiều voice.
- PlayHT, Cartesia Sonic — realtime, low latency.
- Coqui XTTS-v2 — open source.
- Kokoro — open, lightweight.
Use cases thực tế:
STT:
- Voice assistant: STT → LLM → TTS pipeline.
- Meeting transcription + summary (Otter.ai, Fathom, Fireflies).
- Call center analytics — transcribe call, sentiment, compliance check.
- Accessibility — caption livestream.
- Content creation — podcast transcribe → blog.
- Voice search.
TTS:
- Audiobook narration.
- IVR phone system.
- Screen reader accessibility.
- Language learning pronunciation.
- Game / video narration.
- Voice AI agent (Vapi, Retell, Bland).
Challenges:
- Latency cho voice assistant — target < 500ms round-trip (STT + LLM + TTS).
- Interrupt handling — user ngắt lời giữa TTS → cancel.
- Turn-taking — voice activity detection, endpointing.
- Emotion, prosody — TTS phát âm đúng tone context.
- Noise robustness — STT trong môi trường ồn.
- Privacy — audio có giọng nói, PII sensitive.
Pipeline voice AI agent realtime:
Mic → VAD (detect speech) → STT streaming →
buffer câu hoàn chỉnh → LLM →
TTS streaming → Speaker output
+ Interrupt handler (user bắt đầu nói → cancel TTS)
+ Context memory (conversation history)
+ End-of-turn detection (model quyết định khi nào đủ context để respond)End-to-end alternatives (emerging 2024-2025):
- GPT-4o Realtime API — audio in, audio out, không cần split STT+LLM+TTS, latency < 300ms.
- Gemini Live — tương tự, multi-modal.
- Advantage: ít moving parts, preserve prosody/emotion.
- Disadvantage: less control, black box.