System Design: YouTube — Video Upload and Transcoding Pipeline
November 22, 2025 · 8 min read
How YouTube converts a single raw video upload into dozens of optimized formats and resolutions — the transcoding pipeline, adaptive streaming, and global delivery.
YouTube receives 500 hours of video every minute. Every uploaded video must be transcoded into multiple resolutions (360p, 720p, 1080p, 4K), multiple codecs (H.264, VP9, AV1), and packaged for adaptive bitrate streaming — all before the video goes live. Here's how that pipeline works.
Upload and Chunked Transfer
Large video files are uploaded in chunks (typically 5-10MB each) using resumable upload protocols. If a connection drops mid-upload, only the missing chunks need to be re-sent. Each chunk is written to a distributed blob store (GCS/S3). Once all chunks arrive, they're reassembled into the raw source file and a TranscodeVideo job is queued.
// Resumable upload: track which chunks arrived
const uploadSession = {
videoId: 'abc123',
totalChunks: 48,
receivedChunks: new Set<number>(),
}
async function receiveChunk(chunkIndex: number, data: Buffer) {
await storage.put(`raw/abc123/chunk_${chunkIndex}`, data)
uploadSession.receivedChunks.add(chunkIndex)
if (uploadSession.receivedChunks.size === uploadSession.totalChunks) {
await assembleAndQueue(uploadSession.videoId)
}
}Transcoding: Parallelized by Resolution
Transcoding is CPU-intensive and the slowest step. The key insight: each output resolution is independent. YouTube splits the raw video into segments (2-10 second GOPs — Groups of Pictures), then transcodes each segment in parallel across a fleet of worker machines. A 1-hour video might be split into 1800 segments, each transcoded in parallel at all target resolutions simultaneously.
- ▸360p (640×360): low bandwidth, mobile data / poor connections
- ▸480p (854×480): standard quality
- ▸720p (1280×720): HD — most common default
- ▸1080p (1920×1080): Full HD
- ▸1440p / 4K: for premium content and large screens
- ▸Each resolution encoded in H.264 (compatibility) and VP9/AV1 (better compression for modern browsers)
Adaptive Bitrate Streaming (DASH / HLS)
The output isn't a single video file — it's a manifest file (MPD for DASH, .m3u8 for HLS) that lists all available quality levels and their segment URLs. The player downloads the manifest, monitors available bandwidth every few seconds, and requests segments at the highest quality that fits. This is why YouTube seamlessly switches from 1080p to 480p when your WiFi degrades — it just switches which segment URL it requests next.
# DASH manifest (simplified)
# Player reads this and picks quality based on bandwidth
<AdaptationSet mimeType="video/mp4">
<Representation id="360p" bandwidth="500000" width="640" height="360" />
<Representation id="720p" bandwidth="2500000" width="1280" height="720" />
<Representation id="1080p" bandwidth="8000000" width="1920" height="1080" />
</AdaptationSet>
# Each Representation has segment URLs:
# /segments/abc123/720p/seg_001.m4s
# /segments/abc123/720p/seg_002.m4s ...Thumbnail and Metadata Extraction
In parallel with transcoding, a separate pipeline extracts: thumbnail candidates at regular intervals (user picks or auto-selected by a visual quality model), speech-to-text for auto-captions (Whisper-class model), chapter detection, content moderation flags, and video fingerprint (Content ID matching for copyright). All of these run as independent workers consuming from the same job queue.
Global CDN Delivery
- ▸Transcoded segments are stored in GCS and replicated to CDN edge nodes globally
- ▸Popular videos are pre-warmed to edge nodes nearest to high-traffic regions
- ▸Cache key: segment URL includes resolution and segment index — highly cacheable
- ▸First few seconds of a video are prioritized for transcoding to minimize playback start delay
- ▸Cold-start optimization: 360p becomes available first (~30s after upload), higher resolutions follow
- ▸Live streaming uses a shorter segment duration (2s vs 10s) to reduce latency at the cost of more manifest fetches