EE5390 Course Information

Table of Contents

Welcome to EE5390 (Source coding) 2022 edition. This is a follow-up course to EE 2340 (Information Sciences) and EE 5847 (Information theory). This course will cover the basic concepts of data compression. Classes will be online, primarily on Google meet/classroom.

Google classroom links will be shared with registered students. If you have not got these links by the first class, please send me an email.

1 Prerequisites

  • Comfort with probability and random processes, and mathematical arguments
  • Ability to program in python/C/C++
  • EE2340 (Information Sciences) or EE5847 (Information theory)

2 Assessment (tentative):

Each student will be expected to

  • attend classes regularly and participate in discussions
  • solve in-class quizzes
  • solve homework assignments, roughly one per week (collaboration is fine as long as it is explicitly declared, but you have to write the solutions in your own words), and participate in class (online and/or offline)
  • present a paper/demo on data compression and submit a report. Details will be announced later.
Attendance and class participation 10%
Quizzes 10%
Homeworks 50%
Paper/demo/Final exam 30%

NOTE: Requests for makeup quizzes will not be entertained unless you have taken prior permission.

3 Instructor:

Name Dr. Shashank Vatedka
Office 214/D, Block-C
Email shashankvatedka@ee.iith.ac.in

4 Class timings:

  • Slot E: Tuesdays 10:00-11:00, Thursdays 12:00-13:00 and Fridays 9:00-10:00
  • Class venue: Google classroom
  • Quiz schedule: in regular classes

5 Textbook and references:

Primarily, the material will be drawn from

  • Elements of Information Theory, (EIT) Thomas M Cover and Joy A Thomas, second edition, Wiley inc. (Amazon link). Limited number of copies available in the university library. While the first edition of the book has all the material we need for the course, the second edition has a much expanded problem set.
  • Lecture notes/slides, as and when needed.

Other references:

  • "A mathematical theory of communication", Claude Shannon, Bell systems Tech Journal, 1948. The classic paper that started this field.
  • Information theory, inference, and learning algorithms, David MacKay (soft copy available for free on the author's website). A very nice book, but gives a very different flavour that what will be covered in the course. Suggested for additional reading.
  • A student's guide to coding and information theory, Stefan Moser (amazon link).
  • Information theory: coding theorems for discrete memoryless systems, Imre Csiszar and Janos Korner. An advanced textbook. Recommended reading after covering this course as well as EE6317.

Light reading:

6 Tentative syllabus:

Recap of basic concepts in information theory; Source coding theorem; Fixed versus variable length compression; Typical sets and the asymptotic equipartition property; Proof of the source coding theorem; Variable length compression: nonsingular, uniquely decodable and prefix-free codes; Kraft inequality; Optimal codes and bounds on expected length; Huffman codes and optimality; Shannon-Fano-Elias codes and optimality; Universal codes; Arithmetic codes; Lempel-Ziv codes and optimality; Burrows Wheeler Transform (if time permits).

7 Tentative schedule

Topic Date Notes/links/material Homeworks
Introduction Class 1 Class notes  
- The data compression problem      
- Lossy/lossless and fixed/variable length compression      
- Recap/preliminaries      
Fixed-length compression of memoryless sources Class 2-3 EIT Chapter 3  
- Discrete memoryless sources      
- Fixed and variable length codes      
- Typical sets and their properties      
- Source coding theorem for memoryless sources      
Variable-length compression Class 4-5 EIT Chapter 5  
- Nonsingular, uniquely decodable and prefix-free codes   Class notes  
- Source coding theorem for variable-length codes      
- Bounds on optimal code length      
- Kraft inequality, tree codes      
Huffman and Shannon-Fano-Elias codes Class 6-8 EIT Chapter 5  
- Huffman codes, construction      
- Optimality      
- Shannon-Fano-Elias      
Universal codes Class 9    
- Introduction   EIT Section 13.3  
- Arithmetic codes   Slides  
- Properties      
The Burrows-Wheeler Transform Class 10    
The Lempel-Ziv algorithms Class 11-12 EIT Section 13.4  
- LZ77      
- LZ78      
- Properties, optimality   EIT Section 13.5.1  
- Trie interpretation      

8 Lectures

All recorded lectures are available via this playlist

9 Suggested list of papers/projects

You are allowed to work individually or a group of 2. Each group must give a short presentation (around 15 mins including questions), and submit a report. You may also choose a project, in which case you are expected to give a working demo (code).

The list of topics below are only indicators of what is expected. You may have to read more background material. You are also free to choose a related topic not listed below.

Good sources for the latest works on data compression are the annual International Symposium on Information Theory (ISIT, but more broadly covers topics in Information Theory), the Data Compression Conference (DCC) and the IEEE Transactions on Information Theory. The Arxiv preprint server (https://arxiv.org/list/cs.IT/recent) also contains unpublished/yet-to-be published work and sometimes more detailed versions of published papers.

9.1 Compression of large collections of genomes

Genomes of two individuals of the same species are highly related. These are perhaps not well modeled by standard distributions, since one DNA sequence can be obtained by performing a small number of edit operations (insertion, deletion, transposition) on another. Traditional compression algorithms don't do a very good job of compressing large collections of DNA sequences.

  • Deorowicz et al, "Genome compression: a novel approach for large collections," Bioinformatics
  • Deorowicz et al, "GDC2: Compression of large collections of genomes," Scientific reports

9.2 Compression of FASTQ files

Finding the DNA sequence of an individual is a very interesting process in itself. The most popular technique for sequencing (estimating the DNA sequence) is called shotgun sequencing. This yields short subsequences of the actual DNA sequence (with errors). These, and the quality scores (a score measuring how reliable the read is) are stored in a format called FASTQ. Compressing such files is of great interest.

  • Deorowicz and Grabowski, "Compression of DNA sequence reads in FASTQ format," Bioinformatics
  • Ochoa et al, "QualComp: a new lossy compressor for quality scores based on rate distortion theory," BMC Bioinformatics
  • Chandak et al, "SPRING: a next-generation compressor for FASTQ data"

9.3 Reconstructing strings from random traces

How easy is it to reconstruct an unknown string if we are only given short substrings of it (without mentioning which locations they are taken from)? This is called the trace reconstruction problem, and is a fascinating topic closely related to the problem of reconstructing DNA from shotgun reads.

  • Batu et al, "Reconstructing a string from random traces," SODA 2004
  • Acharya et al, "String reconstruction from its substring compositions," https://arxiv.org/pdf/1403.2439.pdf

9.4 Compression with random insertions and deletions

This is also closely related to the above problems, but also occurs in cloud synchronization. If we have two files, one of which is an updated version of the other (and hence highly correlated), then how do we compress this jointly? In particular, if the second file is obtained by randomly inserting/deleting text from the first, then what is the minimum rate of compression of the second file given that both the compressor and decompressor have access to the first?

9.5 Zstandard compression algorithm

This is optimized for compression and decompression speed. See https://facebook.github.io/zstd for more details. The exact values of the parameters are not important to study here. What you need to focus on are the basic concepts, what the building blocks of the algorithm are, etc.

9.6 Genomic compression standards

MPEG-G is a new set of standards for storage and compression of sequencing data

9.7 Graph compression

What if the source is not a sequence but has a different structure, such as a graph? This arises in instances such as compression of the details of users of a social network, where we can consider each user as a vertex, and two users are connected if they are "friends."

9.8 Universal compression of memoryless sources over unknown alphabets

In most circumstances we can handle situations where the distribution of the source is unknown. But what happens when the alphabet itself is unknown, and what if the alphabet size could be arbitrarily large? In this scenario, it might not be possible to get a reliable estimate of the input distribution.

9.9 Processing over compressed data

Traditional compression algorithms are designed only for storage. Suppose that we have a large database, which we need to compress in order to save space. Suppose that we use gzip, zip, etc. for compression. If we want to query this database, then we have to first decompress, only then can we query the database. Can we design a compression algorithm such that the queries can be made directly on the compressed file?

9.10 Other topics

10 General comments regarding presentations/reports

You are expected to prepare a short (15min) recorded presentation summarizing the paper you have chosen. You are required to present slides (or equivalent) conveying the problem statement, important technical results, key ideas used to obtain those results, and your opinions on the paper (is this interesting/what are the limitations/what could be improved/…).

In addition to learning cool stuff in information theory, you should also learn to present your work. The final presentation is an opportunity to develop your presenting skills. You must be able to convey your ideas and thoughts clearly to another person. Here are some general comments for making good presentations:

  • Make sure that you acknowledge all the references/material that you used (papers/code/etc.) While it does not matter much for the purposes of this class, this is very important to keep in mind when you are delivering presentations at bigger venues
  • In a presentation, always introduce yourself and your team members/collaborators in the very beginning.
  • If much of the presentation is based on on one/two papers/references, explicitly state which references you are using in the first/second slide.
  • Avoid saying "We did this/we showed that…" when you are in fact talking about some other authors' work. Unless you have derived something/programmed something yourself, always say "they/the authors did/showed/observed that…"
  • A presentation/report is like telling a story: even before preparing the slides/document, imagine that you want to explain the work to a friend in 15-20 mins, identify exactly what you want to convey, and only then proceed.
  • Make sure that you explain the problem clearly.
  • When reading a paper, you should first understand what the problem is, what makes it challenging, and why it is worthwhile solving the problem. Then, you should identify what is new about the paper, and what the contributions are. To understand the rest of the paper, you should then try to understand the constructions/proof techniques/ideas. Your presentation should also bring out these aspects. See this article and this one on how to read a paper.
  • There is no single best rule for presentations, but generally in any presentation, the viewer/audience's attention is maximum in the first 5-10 minutes and then gradually decreases. So it is helpful to get to the punchline in the first few minutes.
  • Citations should be treated differently in papers and slides. When you are making a reference to a particular paper/book/website (for e.g., when you say that result A can be found in paper B) in your slides, mention the paper/book/website as a footnote rather than saying [1], and putting the reference in the last slide. 
  • Do not put too much text on your slides. This is bad practice. Each slide should contain at most 3-5 points. You should absolutely avoid long sentences. 
  • Do not copy text directly from the paper.
  • Do not read out your slides word-to-word.
  • Similarly, do not put too many lemmas in a slide. Each slide should convey a single idea. Unless very important, do not put too much math/long derivations in a short presentation. It is more important to convey ideas/intuition. But at the same time, there should be a significant amount of technical content. It is therefore very useful to instead explain using examples.
  • Make sure that the fonts/diagrams/labels/plots are clear. Label all the axes on your plots.
  • Speak clearly, and make sure that the audio quality is fine. Check your mic volume. Do a couple of dry runs.
  • If you are using Google meet to record, then make sure that you "pin" the slides since meet might automatically switch displaying you/your icon instead of the presentation. 
  • Make sure that you check, and double check the slides and report for typos, technical and grammatical errors.
  • If your group has more than one person, then the presentation must be shared equally.

Author: Shashank Vatedka

Created: 2022-03-10 Thu 15:10