Video Encoding Course

Basic concepts: aspect ratio and video

Let's start this second theory lesson by explaining another couple of basic concepts about images (and videos):

The aspect ratio describes the proportional relationship between the width and height of an image, video, or pixel. If we apply it to screens, we call it Display Aspect Ratio (DAR). It is defined as two numbers divided by a semicolon, for example:

aspectratio

The concept can also be used for píxels, so the PAR will describe the relationship between height and width or the shape of the píxel.

pixelratio

Before we explore the video formats, let's agree on what a video is. We can define a video as a succession of n frames in time (which we can see as another dimension). This n is the framerate of frames per second (fps).

videodef

Furthermore, we define bitrate as the number of bits per second needed to show a video. bitRate = width * height * bitDepth * FPS. For example, if we have a video with 30fps, 24 bits/pixel and a 480x240 resolution, we will need 82944000 bits per second, or 82.944 Mbps (30 480 240 * 24) if we don't employ any kind of compression.

When the bitrate is nearly constant, we call it CBR (constant bit rate). On the other hand, if it's variable, VBR (variable bit rate).

variablebitrate

There are also techniques to double the perceived framerate without consuming extra bandwidth. The first one used was interlaced video, which sent half of the screen in one frame, and the other half in the other frame skipping every other line. Nowadays progressive scan video is used, which displays, stores, or transmits moving images drawing all the lines of each frame in a sequence.

intervsprog1 intervsprog2

Once these concepts are settled, we can start talking about video compression methods.

MPEG

As we explained in the previous lesson, standardization is usually developed by groups of experts in the subject. In this case, the Moving Picture Experts Group (MPEG) was established by ISO in 1988 and created the standard (named of course MPEG-1) in 1993.

MPEG-1 is a standard format for compressing and storing digital audio & video. This is a lossy compression method, but unlike images, all video compression formats are lossy, at least for now.

It was designed to compress VHS-quality raw digital video and CD audio down to 1.5 Mbit/s (26:1 and 6:1 compression ratios respectively). VHS had a resolution of only 360x240 pixels, which now would be low quality. It also allowed creating the Video CD format, which didn't succeed.

The conceptual idea behind the MPEG-1 was to create a smart encoder that runs on specialized hardware, and then a simple decoder that could be used by cheap hardware at homes.

mpegdiagram

In the previous lesson, we saw how JPEG could compress an image (or frame). The MPEG doesn't work with each frame, instead uses an interframe or temporal compression. The following diagram describes the method's workflow, which will be detailed in the next sections.

mpegworkflow

Reduction of resolution

This process can require a video downsize, and it translates from RGB to YUV (as in JPEG).

Additionally, it performs chroma subsampling, which consists in implementing less resolution from chroma (U and V, or Cb and Cr) information than for luma (Y), taking advantage of the human visual system's lower acuity for color differences than for luminance. The human eye contains about 120 million rod cells (which detect luminance) and only 6 million cones (which detect colors), hence the reduction of chroma.

chroma sub

The subsampling amount is described as a:x:y, where a is the horizontal sampling reference (usually 4), x the chroma samples in the first row, and y the number of changes between the first and second row of pixels. Common subsampling schemes used in modern codecs are: 4:4:4 (no subsampling), 4:2:2, 4:1:1, 4:2:0, 4:1:0 and 3:1:1.

subsampl

As we can see, the human eye doesn't really tell the difference between the 4 examples.

Motion estimation

To see how motion estimation works, we need to understand the three types of frames that we can find, and how they work in MPEG. Let's assume we have a 30fps movie, and the following are the first 4 frames.

frames

We can see there is a lot of repetition within the frames: the blue background or the different still objects. We can divide the frames into three categories:

I Frame (intraframe, keyframe)

Which is a self-contained frame. It doesn't rely on anything to be rendered, looking similar to a static photo. The first frame is usually an I-frame but we can see them inserted regularly in between other types of frames.

P frame (predicted)

It takes advantage of the fact that almost always the current picture can be rendered using the previous frame. For instance, in the second frame of our example, the only change is the ball moving. We can rebuild frame 1, using the difference and referencing to the previous frame.

pframe

B frame (Bidirectional predicted frame)

Using the same concept, we can not only move to the previous frame but also move to the following one. We'll be predicting the past and the future frame.

The frame types are used to provide better compression. When using this technique, MPEG sequences and reorders the frames with the pattern IBBBPBBBP. An Iframe is the most 'expensive' frame, followed by a P frame and a B frame. Therefore, we can obtain a very convincing and compressed result by only using part of the video's information.

On the other hand, if some information is lost, we will not be able to recover the frames correctly and artifacts will appear:

Motion vector

Motion estimation examines the movement of objects in an image sequence and tries to obtain vectors representing the estimated motion. In this part of the encoding process, we will generate the motion vector data to then be consumed by the decoder.

Note: motion estimation and motion compensation are considered as a single block in some resources, but are two different processes. Motion compensation is the use of motion estimation to achieve compression. If you can describe the motion, then you have to describe the changes that occur after compensating for that motion.

The following sections will be explained without a lot of depth into the formulas or mathematical operations behind them, if you want to learn more check the full slides in PDF at the end of the lesson.

motionvectors

The motion vector represents the movement between two frames of the blocks of the image.

Motion compensation

This process is split into three parts:

Motion Analysis: reviews that the motion estimation part is correct.
Prediction and differentiation: the present frame is predicted by using the estimated vectors and the previous frame. The prediction error is calculated:

Once we have the motion vector for each block, we perform the block matching where we apply the vector to the original block and make it match with the next frame, for every block. As before, further information can be found in the slides PDF.

blockmatching

Encoding: as performed in the JPEG format: DCT->Quantize->Run-length encoding->Entropy encoding (Huffman)

Here's a real-life example of a MPEG encoder:

mpegdiagramreal

What about audio?

MPEG-1 has its own codec based on psychoacoustics. MPEG-1 Layer I (.mp1) is defined in ISO/IEC 11172-3, which first version was published in 1993. It supports the following sampling rates and bitrates:

Sampling rates: 32, 44.1 and 48 kHz
Bitrates: 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416 and 448 kbit/s.

MPEG-2

The same group of experts developed an enhanced version of the format in 1996. it offered higher resolutions such as 720x480 and 1280x720 at 60fps, with full CD-quality audio. This was sufficient for all the major TV standards including NTSC and HDTV. MPEG-2 was used by DVD-ROMS and could compress a 2-hour video into a few gigabytes. Some of the improvements were the following.

Coding performance related to supporting interlaced video input:
- Field/frame prediction modes.
- Field/frame DTC coding syntax
- Downloadable quantization matrix and alternative scan order
- Scalability extension
Non-compression enhancements:
- Syntax to facilitate 3:2 pull-down in the decoder
- Pan and scan codes with 1=16 pixel resolution
- Display flags indicating chromaticity, subcarrier amplitude, and phase (for NTSC=PAL=SECAM source material)
Improved version of the audio codec:
- Extensions, known as MPEG-BC, are backward compatible with MPEG-1 Audio. Defined in ISO/IEC 13818-3
- Introduction of MPEG multichannel (backward compatible 5.1 channel surround sound)
- Improved sampling and bit rates: 16000, 22050, 24000Hz and 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144 and 160 kbit/s, respectively. Sampling rates are half of those in MPEG-1 and were added to maintain higher quality sound when encoding audio at lower bitrates.

MPEG-2 is used in DVD video, HDV, and cameras like MOD and TOD, XDCAM, in the Digital TV standards and even in the first blue-ray video.

Resources

Download Slides T2

Download Practical 2

T2. MPEG & MPEG2