JPEG compression

JPEG is a clever image compression algorithm that exploits a human perceptual bias and the structure of natural images. It makes a change of basis before encoding so that the rewritten image representation concentrates the signal that humans are most perceptive to, and thus, the remaining details can be discarded without horribly degrading visual fidelity.

A digital image is a mapping from two-dimensional spatial coordinates to pixel values:
I:D⊂Z2→C.I: D \subset \mathbb{Z}^2 \to \mathcal{C}.I:D⊂Z2→C. Here DDD is a finite grid of pixel locations and C\mathcal{C}C is the set of possible values. In the RGBRGBRGB model, C={0,…,255}3\mathcal{C}=\{0,\dots,255\}^3C={0,…,255}3, representing three 8-bit color channels.

A property of human vision is that we are more sensitive to luminance (brightness) than to chrominance (color differences). Edges and fine detail in brightness are noticeable, while small variations in color are much harder to perceive. JPEG exploits this by separating brightness from color, meaning that instead of working directly in RGBRGBRGB, each pixel is converted to the Y′CbCrY'CbCrY′CbCr color space:

(R,G,B)→(Y′,Cb,Cr).(R,G,B) \rightarrow (Y', C_b, C_r).(R,G,B)→(Y′,Cb,Cr).

Original RGB image

960px × 720px•Raw RGB: 1.98 MB

RGB R channel

Equivalently, the chroma channels encode how color deviates from the luminance signal: CbC_bCb is proportional to B−Y′B - Y'B−Y′ and CrC_rCr to R−Y′R - Y'R−Y′.

The RGB→Y′CbCrRGB \to Y'CbCrRGB→Y′CbCr conversion is linear:

Y′=0.299R+0.587G+0.114BY' = 0.299R + 0.587G + 0.114BY′=0.299R+0.587G+0.114B

Cb=128+(−0.168736R−0.331264G+0.5B)C_b = 128 + (-0.168736R - 0.331264G + 0.5B)Cb=128+(−0.168736R−0.331264G+0.5B)

Cr=128+(0.5R−0.418688G−0.081312B)C_r = 128 + (0.5R - 0.418688G - 0.081312B)Cr=128+(0.5R−0.418688G−0.081312B)

These weights are chosen so that Y′Y'Y′ approximates perceived brightness from the RGBRGBRGB primaries used by the image standard, so the coefficients reflect how much each channel contributes to brightness rather than color alone. Green receives the largest weight because human vision is most sensitive in the middle of the visible spectrum, and because fine spatial detail is carried much more strongly there. The constant 128128128 shifts chroma values into the standard 000–255255255 range used by 8-bit images.

Once luminance and chroma are separated, JPEG can reduce the spatial resolution of the chroma channels before applying further compression. This step is called chroma subsampling.

In 4:4:4, every pixel keeps its own chroma values. In 4:2:2, chroma is shared horizontally across neighboring pixels. In 4:2:0, chroma is shared across small 2×22 \times 22×2 neighborhoods. In 4:1:1, chroma is shared across groups of four horizontal pixels.

Y' Grid

Cb Grid

Cr Grid

Because the human visual system is less sensitive to fine color detail, the reconstructed image still looks almost identical even though many chroma samples have been removed.

Chroma subsampled image

Chroma subsampling 444 reconstruction

Raw RGB: 1.98 MB•Subsampled: 1.98 MB

After chroma subsampling, JPEG no longer stores pixel values directly. Instead, across the Y′Y'Y′, CbC_bCb, and CrC_rCr channels, each 8×88 \times 88×8 block is transformed into a weighted sum of cosine waves using the discrete cosine transform (DCT). The coefficient in the top-left represents the average brightness of the block, while coefficients farther to the right or lower down correspond to progressively higher horizontal or vertical spatial frequencies. In the basis functions below, uuu is the horizontal frequency index and vvv is the vertical frequency index.

Bu,v(x,y)∝cos⁡((2x+1)uπ16)×cos⁡((2y+1)vπ16)B_{u,v}(x,y) \propto \cos\left(\frac{(2x+1)u\pi}{16}\right)\times\cos\left(\frac{(2y+1)v\pi}{16}\right)Bu,v(x,y)∝cos(16(2x+1)uπ)×cos(16(2y+1)vπ)

DCT alone does not reduce the number of stored values: an 8×88 \times 88×8 block still produces 646464 coefficients. What changes is the representation. Because natural images are locally smooth, most of the signal lies in low spatial frequencies. After the transform, a small number of coefficients contain most of the block's energy, while many high-frequency coefficients are close to zero.

8x8 Y' block: 1 / 10800

Basis function multiplied by coefficient

Starting from the original block.

When the coefficients are quantized, many high-frequency terms are clipped to 000. Because the human visual system is less sensitive to fine, high-frequency detail, these coefficients can be discarded with little visible change to the image.

JPEG quantization divides each coefficient by an entry in a quantization matrix and rounds:

C^u,v=round⁡ ⁣(Cu,vQu,v).\hat{C}_{u,v} = \operatorname{round}\!\left(\frac{C_{u,v}}{Q_{u,v}}\right).C^u,v=round(Qu,vCu,v).

In practice, QQQ is usually taken from a standard JPEG luminance quantization table and then scaled by a quality factor. Larger values of Qu,vQ_{u,v}Qu,v cause more aggressive rounding. Since the entries of the matrix get larger toward higher spatial frequencies, those coefficients are much more likely to collapse to 000.

8x8 Y' block: 1 / 10800JPEG quality: 70

DCT coefficients

-151

24

-31

5

-17

-7

~0

4

67

35

-22

-20

-8

-14

~0

5

-1

45

-37

-2

-5

-14

~0

6

12

-3

-12

-5

~0

2

-6

17

~0

-9

-5

~0

-1

7

5

~0

-3

-7

-4

~0

2

6

~0

-4

-7

-3

-1

1

3

5

~0

-5

-4

-3

~0

Quantization matrix

10

7

6

10

14

24

31

37

7

8

11

16

35

36

33

8

10

14

24

34

41

34

8

10

13

17

31

52

48

37

11

13

22

34

41

65

62

46

14

21

33

38

49

62

68

55

29

38

47

52

62

73

72

61

43

55

57

59

67

60

62

59

Quantized coefficients

-15

3

-5

1

-1

0

10

5

-3

-2

-1

0

6

-4

0

2

1

0

-1

0

-1

1

0

47 of 64 coefficients are zero after quantization.

After quantization, JPEG still has to store the remaining coefficients efficiently. It does this by scanning the 8×88 \times 88×8 block in a zigzag pattern so that low frequencies appear first and the long tail of high-frequency zeros gets grouped together.

The first zigzag entry is the DC coefficient, which stores the average brightness of the block and is encoded separately as a difference from the previous block's DC term. The remaining AC coefficients are encoded with symbols of the form (run,size)(\mathrm{run}, \mathrm{size})(run,size), where run is the number of zeros before the next nonzero value and size is the bit-width category of that next value. Standard JPEG luminance Huffman tables assign Huffman codes to each symbol, and finally the amplitude bits for the coefficient itself.

8x8 Y' block: 1 / 10800JPEG quality: 70Zigzag step: 16 / 64

Quantized block with zigzag scan

-15

3

-5

1

-1

0

10

5

-3

-2

-1

0

6

-4

0

2

1

0

-1

0

-1

1

0

Entropy-coded symbols

DC Δ0 • 00

(0,2) +3 • 01 + 11

(0,4) +10 • 1011 + 1010

(1,3) +5 • 1111001 + 101

(0,3) -5 • 100 + 010

(0,1) +1 • 00 + 1

(0,2) -3 • 01 + 00

(0,3) +6 • 100 + 110

(0,2) +2 • 01 + 10

(0,1) -1 • 00 + 0

(0,1) +1 • 00 + 1

(0,3) -4 • 100 + 011

(0,2) -2 • 01 + 01

(0,1) -1 • 00 + 0

Fixed-width symbol stream

136 bits

JPEG Huffman-coded stream

66 bits

After entropy coding, the image file contains a compact description of the DCT coefficients rather than the pixel values themselves. Decoding just runs the same pipeline in reverse. The coefficients are recovered from the entropy codes, rescaled by the quantization matrix, and passed through an inverse DCT to reconstruct each 8×88 \times 88×8 block. The chroma channels are then upsampled, and the image is converted from Y′CbCrY'CbCrY′CbCr back to RGBRGBRGB for display.