Opus Audio Codec: Next-Gen Standard | WebRTC, Low Latency & FFmpeg
이 글의 핵심
Opus: low latency, speech/music modes, WebRTC integration, and FFmpeg—next-gen royalty-friendly audio in one read.
Introduction
Opus (IETF RFC 6716) combines speech (SILK lineage) and music (CELT lineage) in one codec. ~2.5–400 kbps, adjustable frame durations, and low algorithmic delay make it strong for video calls, voice chat, and games where latency budgets are tight. It is widely known as royalty-friendly for browsers, mobile, and servers.
This article explains why Opus is the default WebRTC audio codec, what mode switching means, and how to encode with FFmpeg. It also contrasts AAC and MP3 from a product perspective.
After reading this post
- Outline Opus history (Speex + CELT) and the hybrid design
- Pick bitrate and frame size for speech vs music
- Encode Opus in Ogg/WebM with FFmpeg
- Contrast WebRTC, browsers, and VoIP with AAC/MP3 roles
Table of contents
- Codec overview
- How compression works
- Practical encoding
- Performance comparison
- Real-world use cases
- Optimization tips
- Common problems
- Wrap-up
Codec overview
History and background
Opus merges Xiph.org CELT with Skype’s SILK in a hybrid standardized by the IETF codec working group—one codec instead of splitting speech vs music tools. After 2012, adoption grew with WebRTC.
Technical characteristics
| Topic | Description |
|---|---|
| Compression | Speech: LPC-style; music: MDCT (CELT); internal mode switching |
| Sample rate | 8–48 kHz internally; 16 / 48 kHz common in VoIP |
| Bitrate | ~6–510 kbps (mono upper bound varies by spec); speech works at tens of kbps |
| Latency | 2.5–20 ms frame sizes—tunable for low delay |
Modes: speech vs music
- Speech-heavy, low bitrate: Narrowband through fullband—bits follow speech statistics.
- Music / fullband: CELT dominates; stereo music usually wants ≥128 kbps in practice—below that, music collapses fast.
In realtime stacks, packet loss concealment (PLC) and jitter buffers sit alongside the codec (full WebRTC picture).
How compression works
Psychoacoustic model
CELT paths use masking like other perceptual codecs. SILK emphasizes speech production and puts bits on important formants.
MDCT
The CELT path uses MDCT with short frames and low-delay windows—critical for conversational quality.
Bit allocation
Opus detects content and adjusts mode and bit split. Users usually control bitrate, frame duration, and channels indirectly.
Pipeline (conceptual)
flowchart TB IN["PCM input"] DET["Speech vs music path"] SILK["SILK-style processing"] CELT["CELT (MDCT) processing"] MIX["Bitstream pack"] OUT["Opus frame"] IN --> DET DET --> SILK DET --> CELT SILK --> MIX CELT --> MIX MIX --> OUT
Practical encoding
Ogg Opus: mono voice/podcast ~32 kbps
ffmpeg -i input.wav -c:a libopus -b:a 32k -ac 1 -ar 48000 voice.opus
Stereo music 128 kbps (common starting point)
ffmpeg -i input.wav -c:a libopus -b:a 128k -ac 2 -ar 48000 music.opus
Higher music quality 160–192 kbps
ffmpeg -i input.wav -c:a libopus -b:a 192k -ac 2 -ar 48000 music_hq.opus
WebM (with video passthrough)
ffmpeg -i input.mkv -c:v copy -c:a libopus -b:a 96k -ac 2 output.webm
Parameter guide
-frame_duration: When supported, shorter frames help ultra-low latency—seeffmpeg -h encoder=libopus.- VBR: Often default; some networks need near-CBR.
- 48 kHz aligns with WebRTC conventions for voice.
Quality vs size
Speech-only can work 12–40 kbps; music below ~96 kbps often shows obvious artifacts. For archival music, start 128 kbps stereo and ABX from there.
Performance comparison
Compression
- Low-bitrate speech: Opus is very strong vs legacy speech codecs.
- Music file distribution: AAC vs Vorbis vs Opus depends on bitrate, encoder, and taste—Opus’s edge is often realtime + low delay.
Speed
libopus is fast on embedded and mobile; WebRTC stacks add DSP optimizations.
MOS / quality metrics
Voice: PESQ/POLQA; music: MUSHRA-style tests. Production needs E2E tests including jitter and loss.
Real-world use cases
Streaming
On-demand music still often uses AAC. Discord-style voice and game chat center on Opus.
Mobile
WebRTC video uses Opus by default. Upload-only recording may still favor AAC or FLAC.
VoIP and WebRTC
Opus is effectively mandatory in modern WebRTC—look for opus/48000/2 in SDP. Packet time, FEC, and NACK shape perceived quality.
Browsers
Modern browsers support WebRTC Opus. For file playback, check Ogg Opus and WebM support (see MDN).
Optimization tips
Smaller files without trashing quality
- Speech-only: try mono + 24–48 kbps.
- Music: cutting bitrate too far destroys stereo imaging—guard 128 kbps with ABX.
Faster encoding
Batch jobs gain more from parallel files than micro-tuning presets:
parallel ffmpeg -y -i {} -c:a libopus -b:a 128k -ac 2 {.}.opus ::: *.wav
Validation
opusinfo (opus-tools) can verify headers before release.
Common problems
Compatibility
- Old Windows players may limit Ogg—wide audiences sometimes need MP3/AAC too.
- Ogg Opus vs Opus in WebM are different container paths—do not confuse them.
Quality
- Do not expect music at VoIP bitrates—set minimum bitrate by use case.
- Heavy mic DSP (noise suppression, AGC) can damage audio before the codec—tune preprocessing.
Licensing
Opus is widely treated as royalty-friendly; follow BSD license attribution and internal policies. Seek legal sign-off when patents are a concern.
Wrap-up
Summary
- Opus is a speech + music hybrid with tunable frame sizes—ideal for low-latency realtime.
- It is the de facto WebRTC audio codec for games and collaboration.
- File distribution works, but split minimum bitrates for speech vs music.
When to choose Opus
- Realtime voice, video, games: Opus first; tune with network and jitter buffers.
- Broad device music streaming: often AAC pipelines; experiment with Opus for new products.
- Voice archives / podcasts: mono low-bitrate Opus can save bandwidth and RSS cost.
References
- RFC 6716: Definition of the Opus Audio Codec
- Opus: https://opus-codec.org/
- FFmpeg
libopus:ffmpeg -h encoder=libopus