\documentclass[a4paper,11pt]{article} \usepackage{latexsym} \usepackage[dvips]{graphicx} %\usepackage[french]{babel} %\usepackage{amsmath} %\author{Professeur N. Cerf} \title{Information Theory and Data Compression thanks to the Lempel-Ziv Code} \addtolength{\voffset}{0.0cm} \addtolength{\textheight}{1.0cm} \addtolength{\hoffset}{-1cm} \addtolength{\marginparwidth}{0.0cm} \addtolength{\textwidth}{2.0cm} \newcommand{\bitem}{\item[$\bullet$]} \newcommand{\citem}{\item[$\circ$]} \newcommand{\mod}{\mathop{\mathrm{mod}}} \newcommand{\fract}{\mathop{\mathrm{frac}}} \newcommand{\erf}{\mathop{\mathrm{erf}}} \begin{document} \date{} \maketitle %\abstract{Introduction au code de Lempel-Ziv dans le cadre de la s\'eance %numero 5 de travaux pratiques de th\'eorie de l'information et du codage.} \section{Introduction} \paragraph{} Lempel and Ziv have invented two data compression codes, often referred to as LZ77 and LZ78 because of their publication dates \cite{lz77, lz78}. Here, we only discuss LZ78 which we simply call ``Lempel-Ziv code''. \paragraph{} The Lempel-Ziv code is a code that offers the possibility to compress a source. Its advantages are: \begin{itemize} \bitem The compression is universal, that means the probability distribution of the source symbols does not need to be known in advance. \bitem Asymptotically, the code is optimal, \emph{i.e.}, the number of bits tends to the entropy. \bitem The algorithm is easy to implement. \end{itemize} \section{Principle} \paragraph{} The source sequence (or string) is divided into minimal substrings which did not occur before. For example, $\mathtt{1011010100010}$ is divided into $\mathtt{1},\mathtt{0},\mathtt{11},\mathtt{01},\mathtt{010},\mathtt{00},\mathtt{10}$. %o\`u bien entendu les virgules d\'elimitent les %sous-cha\^\i nes. Each time one wants to extract a substring, one has to take into account the previous substrings and needs a sufficient amount of bits in order not to recreate a substring that has already occurred before. For $\mathtt{010}$ in our example $\mathtt{0}$ exists already, $\mathtt{01}$ as well, and $\mathtt{010}$ is the minimal substring that has not occurred before. We remark that each substring has as prefix a previous substring --- if this was not the case, the extracted substring would not be minimal. \paragraph{} Once the sequence is divided, the Lempel-Ziv code transforms each substring in a couple that contains: \begin{itemize} \bitem the position of the longest previous substring which is the prefix of the current substring and \bitem the added character that makes the current substring unique. \end{itemize} \paragraph{} For instance, the substrings $\mathtt{1},\mathtt{0},\mathtt{11},\mathtt{01},\mathtt{010},\mathtt{00},\mathtt{10}$ are encoded as $(0,\mathtt{1})$, $(0,\mathtt{0})$, $(1,\mathtt{1})$, $(2,\mathtt{1})$, $(4,\mathtt{0})$, $(2,\mathtt{0})$, $(1,\mathtt{0})$. By convention the substrings are numbered from left to right, starting by 1. The substring with position 0 is the empty substring. \paragraph{} In other words, the Lempel-Ziv code transforms a sequence of elements of an alphabet $\mathcal{A}$ in a new sequence of elements belonging to $\mathbf{Z} \times \mathcal{A}$. In the example mentioned above the alphabet is binary. \paragraph{} As discussed before, the number of the position also needs to be encoded (here in binary or more generally in an alphabet $\mathcal{A}$). The longer the source sequence is, the more bits we need to encode the position. Fortunately, one can verify that asymptotically the number of bits per symbol tends to the entropy of the source. \paragraph{} We remark finally, that the fundamental disadvantage of Lempel-Ziv is the demand for a possibly big memory in order to stock all the past substrings. Asymptotically, this memory is not bounded. \section{Implementation} \paragraph{} In order to encode with Lempel-Ziv, one needs to pre-code as described in the following. One assumes to have a dictionary that contains all past substrings, where the entries are enumerated. \begin{enumerate} \item One initializes the dictionary by putting the empty substring at position 0. \item One extracts a substring of $n+1$ bits\footnote{We speak for simplicity of bits. In general, it is easy to generalize the method for another alphabet.} of the sequence that one wants to encode. This substring has to be the shortest substring that is not present in the dictionary, where the first $n$ bits correspond to an existing entry of the dictionary. \item At the output one generates a code that represents the number of the entry and the $(n+1)$-th bit of the substring, which was obtained in step 2. \item One adds the extracted substring to the dictionary. The number of the entry corresponds to the first empty entry\footnote{We remark that here the number of the entry in the dictionary is equal to the number of appearance of the substring, in other words the position of the substring}. \item One repeats steps 2 to 4 until the entire source sequence is encoded. \end{enumerate} \paragraph{} We remark: \begin{itemize} \bitem In order to properly treat the end of the sequence, we add a special character at the end. \bitem The decoding can be directly deduced from the encoding. It suffices if the decoder updates his dictionary in the same way as the encoder. \end{itemize} \paragraph{} In our example with the sequence $\mathtt{1011010100010}$, the algorithm works as follows: %\paragraph{} \begin{center} \begin{tabular}{lll} Substring & Output & Dictionary \\ \hline & & \verb|dico[0] = ""| \\ $\mathtt{1}$ & $(0, \mathtt{1})$ & \verb|dico[1] = "1"| \\ $\mathtt{0}$ & $(0, \mathtt{0})$ & \verb|dico[2] = "0"| \\ $\mathtt{11}$ & $(1, \mathtt{1})$ & \verb|dico[3] = "11"| \\ $\mathtt{01}$ & $(2, \mathtt{1})$ & \verb|dico[4] = "01"| \\ $\mathtt{010}$ & $(4, \mathtt{0})$ & \verb|dico[5] = "010"| \\ $\mathtt{00}$ & $(2, \mathtt{0})$ & \verb|dico[6] = "00"| \\ $\mathtt{10}$ & $(1, \mathtt{0})$ & \verb|dico[7] = "10"| \end{tabular} \end{center} \section{In theory...} \paragraph{} For detailed theoretical explanations, it may help to consult \cite{coverthomas}. We state here the final result, which proves the asymptotic optimality of the Lempel-Ziv code. For a stationary and ergodic source $\{ X_i \}_{-\infty}^{+\infty}$ one can prove that \begin{equation} \limsup_{n \to \infty} \frac{1}{n} l(X_1, X_2, \dots X_n) = H(\mathcal{X}) \quad \mbox{with probability 1}, \end{equation} where $l(X_1, X_2, \dots X_n)$ is the length of a codeword of the Lempel-Ziv code associated to $X_1 \dots X_n$ where $H(\mathcal{X})$ is the entropy of the source. \section{In practice...} \paragraph{} In practice, the variant LZW thanks to Welch \cite{lzw} is often applied. A reference is for instance \verb|compress| under Unix, or the image format GIF. For more details, one can consult \cite{nelson}. \begin{thebibliography}{99} \bibitem{lz77} J. Ziv A. Lempel. A universal algorithm for sequential data compression. \emph{IEEE Transaction on Information Theory}, 23(3), 1977. \bibitem{lz78} J. Ziv A. Lempel. Compression of individual sequences via variable-rate coding. \emph{IEEE Transaction on Information Theory}, 24(5), 1978. \bibitem{coverthomas} Thomas M. Cover and Joy A. Thomas. \emph{Elements of Information Theory.} Wiley, 1991. \bibitem{lzw} Terry Welch. A technique for high-performance data compression. \emph{IEEE Computer}, 17(6), 1984. \bibitem{nelson} Mark Nelson \emph{The Data Compression Book.} M\&T Publishing, 1992. \end{thebibliography} \end{document}