\documentclass[a4paper,11pt]{article} \usepackage{amsmath} \usepackage{amstext} \usepackage{amsthm} \usepackage{mathtools} \usepackage{braket} \usepackage{graphicx,subcaption} \usepackage[percent]{overpic} \usepackage{bbm} \usepackage{bm} \usepackage{threeparttable} \usepackage{tabularx} \usepackage{multirow} \usepackage{booktabs} \usepackage[nohyperlinks]{acronym} \usepackage{microtype} \usepackage{pifont} \usepackage{pdfpages,diagbox} \usepackage[utf8]{inputenc} \usepackage[labelformat=empty,labelsep=none]{caption} \usepackage{enumerate} \usepackage{hyperref} \definecolor{ulbblue}{cmyk}{1,0.75,0.2,0.18} \hypersetup{colorlinks, citecolor=ulbblue, linkcolor=ulbblue, urlcolor=ulbblue } \makeatletter \renewcommand\@pnumwidth{1.8em} \makeatother \usepackage[T1]{fontenc} \usepackage{textcomp} \usepackage{fix-cm} \usepackage[charter,cal=cmcal,sfscaled=false]{mathdesign} \usepackage[paper=a4paper,innermargin=4cm,outermargin=3cm,top=3.5cm,bottom=3.5cm]{geometry} \setlength{\headheight}{13.6pt} \usepackage{setspace} \usepackage[PetersLenny]{fncychap} \usepackage{fancyhdr} \theoremstyle{definition} \newtheorem{exercise}{Exercise} \setlength {\textheight}{26 cm} \setlength {\textwidth}{16.5 cm} \setlength {\topmargin}{-2 cm} \setlength {\oddsidemargin}{-0.4 cm} \begin{document} \noindent \fbox{\parbox{\textwidth}{ INFO-H-451\hfill 2014-2015\\ \centering{ \textsc{\Large Information and Coding Theory}\\ Exercise Sheet 4} }} \section*{} \begin{exercise}~\textbf{Lempel-Ziv code.} \begin{enumerate}[(a)] \item Consider a source with alphabet $\{\mathtt{A}, \mathtt{B}, \mathtt{C}, \mathtt{\_} \}$. Encode the sequence $\mathtt{AA\_ABABBABC\_ABABC}$ with the Lempel-Ziv code. What is the number of bits necessary to transmit the encoded sequence? What happens at the last $ABC$? What would happen if the sequence were $\mathtt{AA\_ABABBABC\_ACABC}$? \item Consider a source with alphabet $\{\mathtt{A},\mathtt{B} \}$. Encode the sequence $\mathtt{ABAAAAAAAAAAAAAABB}$ with the Lempel-Ziv code. Give the number of bits necessary to transmit the encoded sequence and compare it with a naive encoding. \item Consider a source with alphabet $\{\mathtt{A},\mathtt{B}\}$. Encode the sequence $\mathtt{ABAAAAAAAAAAAAAAAAAAAA}$ with the Lempel-Ziv code. Give the number of bits necessary to transmit the sequence and compare it with a naive encoding. \item The sequence below is encoded by the Lempel-Ziv code. Reconstruct the original sequence. \end{enumerate} \begin{align*} 0~&,~\mathtt{A} \\ 1~&,~\mathtt{B} \\ 2~&,~\mathtt{C} \\ 0~&,~\mathtt{\_} \\ 2~&,~\mathtt{B} \\ 0~&,~\mathtt{B} \\ 6~&,~\mathtt{B} \\ 7~&,~\mathtt{B} \\ 0~&,~\mathtt{.} \end{align*} \end{exercise} \begin{exercise}~Each object in a set of $n$ objects can either be faulty or intact. The random variable $X_i$ takes the value 1 if the $i$-th object is faulty and 0 otherwise. We assume that the variables $X_1$, $X_2, \cdots ,X_n$ are independent, with ${\rm Prob}\{X_i=1\}=p_i$ and $p_1>p_2>\cdots >p_n>1/2$. The problem is to determine the set of all faulty objects with an optimal method. \begin{enumerate}[(a)] \item How to find the optimum sequence of yes/no-questions that identifies all faulty objects? \item What is the last question that has to be asked? Which two sets can be distinguished with this question? \end{enumerate} \end{exercise} \begin{exercise}~Alice chooses an object from a set and Bob tries to identify the chosen object with a series of yes/no questions. We assume that Bob uses an optimal code (which has the minimum expected length) with respect to the probability distribution from which Alice chose the object. We observe that on average, Bob needs to ask 35 questions to identify the object chosen by Alice. Find a lower bound on the number of objects in the set. \end{exercise} \begin{exercise}~There are $n$ coins, where one of them may be counterfeit. A counterfeit coin has a different weight from an authentic one. In order to find the counterfeit coin, all coins are weighed. \begin{enumerate}[(a)] \item What is the probability distribution of the counterfeit coin that maximizes the Shannon entropy? Treat the existence or non-existence of a counterfeit coin as an extra element in the probability distribution. \item Establish the equation that bounds from below and above the expected number of weighs, if the optimum method is used. \item Deduce the maximum number of coins $n$, given an average number of weighs $k$. \item Show that this maximum can be attained. \end{enumerate} \end{exercise} \begin{exercise}~A random variable $X$ can take $m$ possible values and has entropy $H(X)$. Let $C$ be an instantaneous code for the source $X$, with the expected length $L(C)=H(X)/\log_2 5=H_5(X)$. \begin{enumerate}[(a)] \item Show that each symbol of $X$ occurs with probability of the form $5^{-n}$, where $n$ is a positive integer. \item Show that $m$ has to satisfy $m=4k+1$ where $k$ is a positive integer. \end{enumerate} \end{exercise} \section*{} \centering \textbf{\color{ulbblue}{The exercises and solutions are available at} \url{http://quic.ulb.ac.be/teaching}} \newpage\noindent \fbox{\parbox{\textwidth}{ INFO-H-451\hfill 2014-2015\\ \centering{ \textsc{\Large Information and Coding Theory}\\ Introduction to the Lempel-Ziv code} }} \section{Background} Abraham Lempel and Jacob Ziv invented two data compression codes, often referred to as LZ77 and LZ78, named after the year each code appeared~\cite{lz77, lz78}. Here, the ``Lempel-Ziv code'' will be referring exclusively to LZ78. The Lempel-Ziv is a source coding technique with the following advantages: \begin{itemize} \item The compression is universal, which means the probability distribution of the source symbols does not need to be known in advance. \item The code is asymptotically optimal, i.e. the number of bits tends to the entropy. \item The code is easy to implement as an algorithm. \end{itemize} \section{Principle} The source sequence (or string) is divided into substrings of minimum length which did not occur before. For example, $\mathtt{1011010100010}$ is divided into $\mathtt{1},\mathtt{0},\mathtt{11},\mathtt{01},\mathtt{010},\mathtt{00},\mathtt{10}$. Each time a substring needs to be extracted, a sufficient amount of memory is needed to store previously decoded substrings to ensure there is no repetition. For $\mathtt{010}$ in our example, $\mathtt{0}$ exists already, so does $\mathtt{01}$. As a result, $\mathtt{010}$ is the shortest substring that has not occurred before. Note that each substring has as prefix a previous substring. If this were not the case, the extracted substring would not be the shortest. Once the sequence is split, the Lempel-Ziv code transforms each substring into a pair comprising: \begin{enumerate} \item The position of the longest decoded substring which is the prefix of the current substring. \item The suffix that makes the current substring unique. \end{enumerate} For instance, the substrings $\mathtt{1},\mathtt{0},\mathtt{11},\mathtt{01},\mathtt{010},\mathtt{00},\mathtt{10}$ are encoded as $(0,\mathtt{1})$, $(0,\mathtt{0})$, $(1,\mathtt{1})$, $(2,\mathtt{1})$, $(4,\mathtt{0})$, $(2,\mathtt{0})$, $(1,\mathtt{0})$. By convention the substrings are numbered from left to right, starting from 1. The substring at position 0 is the empty string. In other words, the Lempel-Ziv code transforms a sequence of elements of an alphabet $\mathcal{A}$ into a new sequence of elements belonging to $\mathbf{Z} \times \mathcal{A}$. In the example above the alphabet is binary. As noted above, the position also needs to be encoded (here in binary but more generally can be in an arbitrary alphabet $\mathcal{A}$). The longer the source sequence, the more bits are needed to encode the position. Fortunately, it can be verified that asymptotically the number of bits per symbol tends to the entropy of the source. Also note that the fundamental disadvantage of the Lempel-Ziv code is the need for a possibly big memory in order to store all the decoded substrings. And this memory is not bounded asymptotically. \section{Implementation} In order to encode using the Lempel-Ziv code, a dictionary that contains all past substrings is needed. Then the algorithm below is used: \begin{enumerate} \item The dictionary is initialized by putting the empty string at position 0. \item Extract a substring of $n+1$ bits\footnote{Bits are used for simplicity. In general, it is easy to generalize the method to any other alphabet.}. This substring has to be the shortest substring which is not present in the dictionary, where the first $n$ bits correspond to an existing entry of the dictionary. \item A code is generated which represents the position of the entry in the dictionary and the $(n+1)$-th bit of the substring, as obtained in the previous step. \item The extracted substring is added to the dictionary. The position of the entry in the dictionary corresponds to the first empty entry\footnote{Note that here the number of the entry in the dictionary is equal to the number of appearance of the substring, in other words its position}. \item Repeat steps 2 to 4 until the entire source sequence is encoded. \end{enumerate} Comments: \begin{itemize} \item In order to properly treat the end of the sequence, a special character is needed at the end. \item The decoding can be directly deduced from the encoding. It suffices if the decoder updates its dictionary in the same way as the encoder. \end{itemize} In our example with the sequence $\mathtt{1011010100010}$, the algorithm works as follows: %\paragraph{} \begin{center} \begin{tabular}{lll} Substring & Output & Dictionary \\ \hline & & \verb|dico[0] = ""| \\ $\mathtt{1}$ & $(0, \mathtt{1})$ & \verb|dico[1] = "1"| \\ $\mathtt{0}$ & $(0, \mathtt{0})$ & \verb|dico[2] = "0"| \\ $\mathtt{11}$ & $(1, \mathtt{1})$ & \verb|dico[3] = "11"| \\ $\mathtt{01}$ & $(2, \mathtt{1})$ & \verb|dico[4] = "01"| \\ $\mathtt{010}$ & $(4, \mathtt{0})$ & \verb|dico[5] = "010"| \\ $\mathtt{00}$ & $(2, \mathtt{0})$ & \verb|dico[6] = "00"| \\ $\mathtt{10}$ & $(1, \mathtt{0})$ & \verb|dico[7] = "10"| \end{tabular} \end{center} \section{In theory...} For a detailed theoretical explanation, it is helpful to read Section 13.4 of~\cite{coverthomas}. We state here the final result, which proves the asymptotic optimality of the Lempel-Ziv code. For a stationary and ergodic source $\{ X_i \}_{-\infty}^{+\infty}$ it can be proven that \begin{equation} \limsup_{n \to \infty} \frac{1}{n} l(X_1, X_2, \dots X_n) = H(\mathcal{X}) \quad \mbox{with probability 1}, \end{equation} where $l(X_1, X_2, \dots X_n)$ is the length of a codeword of the Lempel-Ziv code associated with $X_1 \dots X_n$, where $H(\mathcal{X})$ is the entropy of the source. \section{In practice...} In practice, the variant LZW due to Lempel, Ziv and Terry Welch~\cite{lzw} is the most widely used. Examples include the command \verb|compress| in Unix and the image format GIF. \cite{nelson} gives much more detail. \begin{thebibliography}{99} \bibitem{lz77} J. Ziv A. Lempel. A universal algorithm for sequential data compression. \emph{IEEE Transaction on Information Theory}, 23(3), 1977. \bibitem{lz78} J. Ziv A. Lempel. Compression of individual sequences via variable-rate coding. \emph{IEEE Transaction on Information Theory}, 24(5), 1978. \bibitem{coverthomas} Thomas M. Cover and Joy A. Thomas. \emph{Elements of Information Theory. 2nd Ed.} Wiley, 2006. \bibitem{lzw} Terry Welch. A technique for high-performance data compression. \emph{IEEE Computer}, 17(6), 1984. \bibitem{nelson} Mark Nelson \emph{The Data Compression Book.} M\&T Publishing, 1992. \end{thebibliography} \end{document}