Article: 17005 of comp.compression From: pacalet@elec.enst.fr (Renaud Pacalet) Newsgroups: comp.compression Subject: Re: What is Discrete Cosine Transforms ? Date: 12 Jan 1995 10:06:16 GMT Organization: Ecole Nationale Superieure des Telecommunications, Paris France [ The following post was made in response to the question what is a "discrete cosine transform" and how is it used in JPEG for compression? ] Sorry in advance for my english, etc. I will try to make it understandable. In JPEG, MPEG-I, MPEG-II and H261, the DCT is a two-dimentional one, applied on 8x8 blocks of pixels. The result of a DCT on such a block is a 8x8 block of coefficients. It can be described as a matrix product: F = TC * X * C Where X is the original pixel block, F is the DCT result of X, C and TC are 8x8 matrix of constants. Let A(i,j) be the element of the matrix A that is located on row i and column j. The matrix C is defined by: C(i,j) = K(j) / 2 * cos((2 * i + 1) * j * PI / 16) where K(0) = sqrt(1 / 2) K(j) = 1 when j != 0 TC(i,j) = C(j,i) The DCT is said to be an orthogonal transforme because TC * C = I (and C * TC = I), where I(i,i) = 1 and I(i,j) = 0 when i != j. So, as you can see, X = C * F * TC This second transform is the inverse DCT (IDCT). The original form of block X can be restored from F, its DCT form. In fact, this is not true, because the DCT coefficients are usually rounded to integer values and the rouding noise makes the restored block be different from the original one. This transform is not lossless. But, as you can be as accurate as you want in your computations and keep as many digits as you need in the result, this loss may be considered as neglictable (and it gives you the "lossless" version of JPEG). What is the DCT used for ? Compression. This may look strange to you because starting from a 8x8 block of natural values range 0 to 255, all you gain with DCT is another 8x8 block of integer values range -721 to 721 ! Where is the compression ? Well, consider the DCT as the real part of a Fourrier transform. Your pixel block comes from the picture domain. The DCT form of the same block belongs to the frequency domain. In the upper left corner area of the DCT block you find the low frequency coefficients. In the lower right corner area you find the high frequency coefficients. That is, if your pixel block is from a uniform area of the picture, the DCT form will have only one non-zero coefficient in the upper left corner. If it comes from a part of the picture with no high frequencies (a very common situation in real pictures) most of the coefficients will have zero value. This fact is the first compression in DCT based coding schemes because when transferring (or storing) the transformed picture, you dont scan it in a usual way: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 but in something like: 0 1 5 6 14 15 27 28 2 4 7 13 16 26 29 42 3 8 12 17 25 30 41 43 9 11 18 24 31 40 44 53 10 19 23 32 39 45 52 54 20 22 33 38 46 51 55 60 21 34 37 47 50 56 59 61 35 36 48 49 57 58 62 63 It gives you a 64 long series of values with a lot of consecutive zeros by the end. Such series are efficiently encoded with a run-length method. Take as an example the following DCT block: 129 12 9 0 0 1 0 1 28 8 34 5 1 0 0 0 17 0 0 2 0 0 0 0 6 1 0 0 0 0 1 0 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 4 0 0 7 0 0 8 1 5 0 1 0 0 0 It becomes a 24 long series of events (an event beeing a non-zero value and a number of zero values): (129,0) (12,0) (28,0) (17,0) (8,0) (9,1) (34,1) (6,1) (1,1) (5,1) (1,0) (1,0) (2,1) (2,8) (1,6) (8,0) (1,0) (4,1) (1,4) (1,1) (1,1) (5,7) (7,0) (1,6) that you can encode again using a Huffman like method. The main idea is to optimize the encoding of (almost) uniform areas of the picture. In a second (lossy) processing you can gain even more. The human eye beeing not so sensitive to high frequencies than to low frequencies, you can apply a strong quantization transform to high frequencies (forcing them to zero and increasing the number of consecutive zeros in the final series, making the run-length encoding even more efficient). This is the second idea: the DCT makes it easier to remove useless informations from a picture. In our example, after quantization has been applied, we could have a 13 long series of events, with almost no visible difference for a human eye: 129 12 9 0 0 0 0 0 28 8 34 5 1 0 0 0 17 0 0 2 0 0 0 0 6 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (129,0) (12,0) (28,0) (17,0) (8,0) (9,1) (34,1) (6,1) (1,1) (5,2) (1,0) (2,1) (2,44) For instance, if your picture represents a guy wearing a striped suit with very narrow black and white lines on it (this is a high frequency area of the picture), you can replace the suit by a uniform grey one, very easy to encode and it will make no difference to you, espacially if your TV set is that one your grand mother gave to you years ago. In MPEG, the DCT is applied after differential coding has been applied to the blocks (this is called motion compensation), so the "pixels" are between -255 and 255. The motion compensation reduces the amount of informations to be coded by using informations of a previous picture (and even following picture in some modes, amazing, dont you think ?) in the stream. What else to say ? The DCT is not the best transform we could use for the purpose of compression but it's easy to implement in hardware devices. Sorry if this is too long, hope it helps anyway. +-------------------------------------------+-----------------------+ | Renaud Pacalet | Mes idees et opinions | | Departement Electronique, Ecole Nationale | n'engagent que moi. | | Superieure des Telecommunications | Elles ne constituent | | 46, Rue Barrault 75634 Paris Cedex 13 | en aucun cas le | | Tel : +33 1 45 81 78 08 | reflet d'une | | Fax : +33 1 45 80 40 36 | politique de mon | | Email : pacalet@elec.enst.fr | employeur. | +-------------------------------------------+-----------------------+