Introduction
The previous article looked at the basics of GStreamer and its Rust bindings, showing a simple pipeline that put a test view on the screen. In real-world scenarios, media streams require processing before reaching the sink element — the final destination in a pipeline. This processing relies on two key steps: deciding and encoding. Decoding extracts raw data from a source file, while encoding prepares it for storage or delivery. These steps enable us to manipulate the raw stream and package it into a new file or format for users.
Understanding foundational concepts like container formats, raw files, and encoding is essential for building effective pipeline. This knowledge helps us select the right GStreamer elements and resolve pipeline issues.
What we'll cover
We'll briefly introduce container formats and raw files, with a deeper look at encoding. Knowing how encoding works clarifies both the start (decoding a source) and end (encoding for output) of a pipeline.
In essence, encoding compresses raw media into a smaller format using codecs — special algorithms like H.264 for video or MP3 for audio — to balance file size and quality. This process is typically lossy, meaning some data is discarded, though the loss is subtle enough not to disrupt the viewing experience. Decoding reverses this, restoring the media for playback. The art of encoding is to find a good balance between file size and perceptual quality.
Since encoding and related topics can be complex, we'll keep this article concise by focusing on the basics of container formats, raw files, and encoding. GStreamer simplifies all this by offering elements to manage containers, decoding, and encoding — details we'll dive into next time.
Container
A container format is a file format that organize multiple data streams — like audio, video, subtitles, and metadata — into a single file. Think of it as a digital package or a Rust struct that holds different fields, has related information about how they're stored and synchronized. Muxing combines these streams into the container, while demuxing separate them during playback, allowing a media player to decode and render each part correctly.
Popular container formats include:
- MP4 (MPEG-4 Part 14): A versatile format widely used for online distribution and supported by most devices and browsers.
- WebM: An open-source container optimized for the web.
- MKV (Matroska): Known for flexibility, supporting multiple streams and high-quality playback.
Unlike codecs (e.g. H.264 or ACC), which compress and decompress data, containers focus on structure. For example, an MP4 might contain H.264 video and ACC audio, with quality and file size determined by those codecs, not the container itself. In tools like GStreamer, a demuxer splits these streams for processing — something you could control with Rust code using gstreamer-rs.
Morden browsers like Chrome, Firefox, and Safari support MP4 playback via the HTML5 <video>
element, typically with codecs like H.264. WebM is also widely supported for its open-source nature, though codec compatibility can vary across browsers.
Raw files
Raw files are uncompressed media data generated directly from capturing devices such as cameras, microphones, or other sensors that can record audio or video. The raw files retains the maximum quality and details of the original signal, which makes them suitable for media processing.
For video, common raw formats include .yuv
and raw
, while for audio, uncompressed .wav
and .aiff
are typical examples. Our focus here remains on encoding and decoding, so we'll only touch on raw files briefly, but we'll explore them further in the context of GStreamer processing in the future article.
Video
The video raw file includes several important factors:
- Color space. The mathematical model that describes how colors are represented. The two most common are RGB (red, green, blue) and YUV (Y is luminance, U and V is chrominance).
- Chroma subsampling. Technique that reduces the color information (chroma) relative to the brightness information (luma).
- Bit depth. The number of bits used to represent the color of each pixel component. Common values are 8-bit, 10-bit, and 12-bit.
- Frame rate. Measured in frames per second (fps), indicates how many individual images are displayed each second.
- Resolution. The number of pixels in each dimension (width and height) that form a video frame. Common resolutions include 1920×1080 (Full HD), 3840×2160 (4K UHD), and 7680×4320 (8K).
Audio
The audio raw file includes several important factors:
- Sample rate. The number of audio samples captured per second, measured in Hertz (Hz) or kilohertz (kHz).
- Bit depth. The number of bits used to represent each audio sample.
- Channels. The number of separated audio streams recorded and played back simultaneously.
Encoding
As we mentioned before, raw files are uncompressed, which makes them extremely large and impractical for tasks like storage or streaming. Encoding is a process of compressing these raw files into smaller, more manageable sizes while preserving acceptable quality.This compression is typically lossy, meaning some data from the original file is discarded, but this trade-off is essential for efficiency in applications like video streaming or archiving.
There are numerous encoding formats available for both audio and video. H.264 is the most commonly used video compression standard today, renowned for its balance of quality and efficiency. For audio, AAC (Advanced Audio Coding) is a widely adopted standard for compressing raw audio files. Beyond these, other standards like H.265 for video or MP3 for audio also play significant roles in modern encoding, offering alternatives depending on specific needs.
To make this easier to grasp, think of encoding like summarizing a book: it keeps the main storyline (the essential data) but skips over minor details to save space. In video encoding, this is achieved by leveraging spatial redundancy, resuing similar data across consecutive frames. Techniques like motion estimation and compensation further exploit temporal redundancy by predicting movement between frames.
While encoding involves complex algorithems we can't fully explore in one article, this section convers the basic concepts to give you a clear starting point. Understanding these fundamentals might spark ideas for optimizing video streaming quality in your future projects.
In a nutshell, video encoding reduces file sizes by removing redundancy, and the techniques used depend on whether the redundancy is within a frame (spatial) or between frames (temporal).
Motion estimation and compensation
Motion estimation and compensation are methods that predict how objects move between consecutive frames. Imaging you're making a flipbook. Each page is a frame of your video.
Motion estimation is like figuring out how much the drawing on one page moved to the next. You look at a part of the drawing (a block of pixels) and find where it moved to on the next page, noting the direction and distance of that movement (the motion vector).
Motion compensation is then using that movement information to predict what the next page should look like. Instead of drawing the whole thing again, you just shift the drawing from the previous page according to the movement you found. Then, you only need to record the small differences between your prediction and the actual drawing on the next page.
The interactive demonstration illustrates the core concepts of motion estimation and compensation. Observe how the square and circle move from the previous frame to the current frame.
In the real implementation, motion estimation and compensation enable video encoders to categorize frames into three types — I-frames, P-frames, or B-frames — based on how they use these predictions. Each types plays a distinct role in balancing compression efficiency and video quality, determining how frames depend on one another.
- I-frame (intra-coded frame). A complete frame encoded independently, containing all the information to display and can be a reference for other frames. It doesn't rely on motion estimation or compensation.
- P-frame (predictive frame). A frame encoded based on the differences from the previous I or P frame using motion estimation and compensation, it requires the earlier frame to be decoded first.
- B-frame (bidirectional frame). A frame encoded using motion estimation and compensation to predict differences from both the previous and next I-frame or P-frame, requiring both for decoding.
These frame types are organized into Group of Pictures (GOP), a sequence of frames that begins with an I-frame followed by a mix of P-frames and B-frames. The GOP structure organize frames to optimize compression efficiency and enable random access.
Transform coding and quantization
Transform coding and quantization are fundamental techniques for spatial compression, designed to reduce data size by selectively discarding information that is less percetually significant. The idea behind this is the characteristics of human vision, which is less sensitive to high-frequency details. Therefore, the controlled loss of these details doesn't affect the media-watching experience.
The process consists of two primary stages: transform coding and quantization. First, transform coding converts the raw data into a frequency-based representation. Following this, quantization reduces the precision of these frequency coefficients, discarding less significant details to achieve compression. Finally, the quantized coefficients are encoded for future storage or transmission.
Discrete Cosine Transform
The transform coding stage is often accomplished using mathematical transforms, such as Discrete Cosine Transform (DCT). While other transforms exist, we will focus on the DCT in this article. The core idea of the DCT is to concentrate the signal's energy into a relatively small number of coefficients. And them we can compress the data through discarding some high-frequency components.
To understand the concept, consider a simplified analogy from geometry. In a 2D Cartensian corrdinate system, any vector can be decomposed into its horizontal and vertical components. If we reduce the bigness of the horizontal component, we effectively reduce the information content of the vector along that axis. Feel free to explore the interactive demo below to better grasp this concepts.
Similarly, the DCT decomposes a signal into its consituent components. We can reducing the percision of high-frequency coefficients, which represent fine details in the image or audio. As the demo shown below, the original signal can be decomposes into difference frequency components.
Quantization
Quantization is a process that simplifying the list of frequency components we get from the previous stage. As we discussed, some frequency components are more important than others. Therefore, we can make the less important components smaller, or even discard them. This is like rounding the numbers. If you have 10.99, you might round it to 11. In quantization, we round the frequency components. We make the weak ones even weaker, and the strong ones stay mostly the same. This process can reduce the information we need to store the picture or video, make it smaller.
However, if we simplify too much, the picture or video maight look blurry or blocky, because we've lost some of the finer details. Again, there are no fixed, directly usable parameters, the art of encoding is to find a good balance between file size and perceptual quality.
You can try the interactive demo below to see quantization in action. Here's what it demonstrates: In image and audio compression (e.g., JPEG and MP3), the DCT transform separates a signal into frequency components, and quantization selectively reduces their precision.
What next?
In this article, we discussed the fundamental concepts of encoding and decoding, including raw file formats, motion estimation and compensation, and transform coding and quantization. We also explored the Discrete Cosine Transform (DCT) and its application in image and audio compression.
I hope this article has provided you with a solid understanding of the principles behind encoding and decoding. Building on this foundation, we will introduce the encoding and decoding elements of GStreamer in the next article.