EE5390 Course Information

1. Prerequisites
2. Assessment (tentative):
3. Instructor:
4. Class timings:
5. Textbook and references:
6. Tentative syllabus:
7. Tentative schedule
8. Lectures
9. Suggested list of papers/projects
10. General comments regarding presentations/reports

Welcome to EE5390 (Source coding) 2022 edition. This is a follow-up course to EE 2340 (Information Sciences) and EE 5847 (Information theory). This course will cover the basic concepts of data compression. Classes will be online, primarily on Google meet/classroom.

Google classroom links will be shared with registered students. If you have not got these links by the first class, please send me an email.

1 Prerequisites

Comfort with probability and random processes, and mathematical arguments
Ability to program in python/C/C++
EE2340 (Information Sciences) or EE5847 (Information theory)

2 Assessment (tentative):

Each student will be expected to

attend classes regularly and participate in discussions
solve in-class quizzes
solve homework assignments, roughly one per week (collaboration is fine as long as it is explicitly declared, but you have to write the solutions in your own words), and participate in class (online and/or offline)
present a paper/demo on data compression and submit a report. Details will be announced later.

Attendance and class participation	10%
Quizzes	10%
Homeworks	50%
Paper/demo/Final exam	30%

NOTE: Requests for makeup quizzes will not be entertained unless you have taken prior permission.

3 Instructor:

Name	Dr. Shashank Vatedka
Office	214/D, Block-C
Email	shashankvatedka@ee.iith.ac.in

4 Class timings:

Slot E: Tuesdays 10:00-11:00, Thursdays 12:00-13:00 and Fridays 9:00-10:00
Class venue: Google classroom
Quiz schedule: in regular classes

5 Textbook and references:

Primarily, the material will be drawn from

Elements of Information Theory, (EIT) Thomas M Cover and Joy A Thomas, second edition, Wiley inc. (Amazon link). Limited number of copies available in the university library. While the first edition of the book has all the material we need for the course, the second edition has a much expanded problem set.
Lecture notes/slides, as and when needed.

Other references:

"A mathematical theory of communication", Claude Shannon, Bell systems Tech Journal, 1948. The classic paper that started this field.
Information theory, inference, and learning algorithms, David MacKay (soft copy available for free on the author's website). A very nice book, but gives a very different flavour that what will be covered in the course. Suggested for additional reading.
A student's guide to coding and information theory, Stefan Moser (amazon link).
Information theory: coding theorems for discrete memoryless systems, Imre Csiszar and Janos Korner. An advanced textbook. Recommended reading after covering this course as well as EE6317.

Light reading:

The information: A history, a theory, a flood James Gleick
A mind at play: How Claude Shannon invented the information age Jimmy Soni and Rob Goodman. A biography of Claude Shannon.

6 Tentative syllabus:

Recap of basic concepts in information theory; Source coding theorem; Fixed versus variable length compression; Typical sets and the asymptotic equipartition property; Proof of the source coding theorem; Variable length compression: nonsingular, uniquely decodable and prefix-free codes; Kraft inequality; Optimal codes and bounds on expected length; Huffman codes and optimality; Shannon-Fano-Elias codes and optimality; Universal codes; Arithmetic codes; Lempel-Ziv codes and optimality; Burrows Wheeler Transform (if time permits).

7 Tentative schedule

Topic	Date	Notes/links/material
Introduction	Class 1	Class notes
- The data compression problem
- Lossy/lossless and fixed/variable length compression
- Recap/preliminaries
Fixed-length compression of memoryless sources	Class 2-3	EIT Chapter 3
- Discrete memoryless sources
- Fixed and variable length codes
- Typical sets and their properties
- Source coding theorem for memoryless sources
Variable-length compression	Class 4-5	EIT Chapter 5
- Nonsingular, uniquely decodable and prefix-free codes		Class notes
- Source coding theorem for variable-length codes
- Bounds on optimal code length
- Kraft inequality, tree codes
Huffman and Shannon-Fano-Elias codes	Class 6-8	EIT Chapter 5
- Huffman codes, construction
- Optimality
- Shannon-Fano-Elias
Universal codes	Class 9
- Introduction		EIT Section 13.3
- Arithmetic codes		Slides
- Properties
The Burrows-Wheeler Transform	Class 10
The Lempel-Ziv algorithms	Class 11-12	EIT Section 13.4
- LZ77
- LZ78
- Properties, optimality		EIT Section 13.5.1
- Trie interpretation

8 Lectures

All recorded lectures are available via this playlist

9 Suggested list of papers/projects

You are allowed to work individually or a group of 2. Each group must give a short presentation (around 15 mins including questions), and submit a report. You may also choose a project, in which case you are expected to give a working demo (code).

The list of topics below are only indicators of what is expected. You may have to read more background material. You are also free to choose a related topic not listed below.

Good sources for the latest works on data compression are the annual International Symposium on Information Theory (ISIT, but more broadly covers topics in Information Theory), the Data Compression Conference (DCC) and the IEEE Transactions on Information Theory. The Arxiv preprint server (https://arxiv.org/list/cs.IT/recent) also contains unpublished/yet-to-be published work and sometimes more detailed versions of published papers.

9.1 Compression of large collections of genomes

Genomes of two individuals of the same species are highly related. These are perhaps not well modeled by standard distributions, since one DNA sequence can be obtained by performing a small number of edit operations (insertion, deletion, transposition) on another. Traditional compression algorithms don't do a very good job of compressing large collections of DNA sequences.

Deorowicz et al, "Genome compression: a novel approach for large collections," Bioinformatics
Deorowicz et al, "GDC2: Compression of large collections of genomes," Scientific reports

9.2 Compression of FASTQ files

Finding the DNA sequence of an individual is a very interesting process in itself. The most popular technique for sequencing (estimating the DNA sequence) is called shotgun sequencing. This yields short subsequences of the actual DNA sequence (with errors). These, and the quality scores (a score measuring how reliable the read is) are stored in a format called FASTQ. Compressing such files is of great interest.

Deorowicz and Grabowski, "Compression of DNA sequence reads in FASTQ format," Bioinformatics
Ochoa et al, "QualComp: a new lossy compressor for quality scores based on rate distortion theory," BMC Bioinformatics
Chandak et al, "SPRING: a next-generation compressor for FASTQ data"

9.3 Reconstructing strings from random traces

How easy is it to reconstruct an unknown string if we are only given short substrings of it (without mentioning which locations they are taken from)? This is called the trace reconstruction problem, and is a fascinating topic closely related to the problem of reconstructing DNA from shotgun reads.

Batu et al, "Reconstructing a string from random traces," SODA 2004
Acharya et al, "String reconstruction from its substring compositions," https://arxiv.org/pdf/1403.2439.pdf

9.4 Compression with random insertions and deletions

This is also closely related to the above problems, but also occurs in cloud synchronization. If we have two files, one of which is an updated version of the other (and hence highly correlated), then how do we compress this jointly? In particular, if the second file is obtained by randomly inserting/deleting text from the first, then what is the minimum rate of compression of the second file given that both the compressor and decompressor have access to the first?

Wang et al, "File updates under random/arbitrary insertions and deletions," https://arxiv.org/pdf/1502.07830.pdf

9.5 Zstandard compression algorithm

This is optimized for compression and decompression speed. See https://facebook.github.io/zstd for more details. The exact values of the parameters are not important to study here. What you need to focus on are the basic concepts, what the building blocks of the algorithm are, etc.

9.6 Genomic compression standards

MPEG-G is a new set of standards for storage and compression of sequencing data

https://mpeg-g.org/documents/

9.7 Graph compression

What if the source is not a sequence but has a different structure, such as a graph? This arises in instances such as compression of the details of users of a social network, where we can consider each user as a vertex, and two users are connected if they are "friends."

Mohri et al, "Automata and graph compression" https://arxiv.org/abs/1502.07288
Choi and Szpankowski, "Compression of graphical structures: fundamental limits, algorithms and experiments," https://ieeexplore.ieee.org/document/6145505
Delgosha and Anantharam, "Universal lossless compression of graphical data" IEEE Transactions on Information Theory

9.8 Universal compression of memoryless sources over unknown alphabets

In most circumstances we can handle situations where the distribution of the source is unknown. But what happens when the alphabet itself is unknown, and what if the alphabet size could be arbitrarily large? In this scenario, it might not be possible to get a reliable estimate of the input distribution.

Orlitsky et al, "Universal compression of memoryless sources over unknown alphabets," https://ieeexplore.ieee.org/document/1306545 (this was groundbreaking work and won an IEEE Trans-IT best paper award)
This paper led to some great ideas, which eventually went into problems of estimation. ICML paper: https://people.ece.cornell.edu/acharya/papers/pml-opt.pdf

9.9 Processing over compressed data

Traditional compression algorithms are designed only for storage. Suppose that we have a large database, which we need to compress in order to save space. Suppose that we use gzip, zip, etc. for compression. If we want to query this database, then we have to first decompress, only then can we query the database. Can we design a compression algorithm such that the queries can be made directly on the compressed file?

Navarro and Makinen, "Compressed full text indexes" https://dl.acm.org/doi/pdf/10.1145/1216370.1216372
Mazumdar and Pal, "Semisupervised clustering by queries and locally encodable source coding," https://arxiv.org/abs/1904.00507
Mazumdar et al, "Local recovery in data compression for general sources," ISIT, https://www.rle.mit.edu/sia/wp-content/uploads/2015/07/2015-mazumdar-chandar-wornell-isit.pdf
Vatedka and Tchamkerten, "Local decode and update for big data compression" https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9115085
There is an entire area of research called compressed data structures, and books have been written about this. See this or this.

9.10 Other topics

The context tree weighting method: https://ieeexplore.ieee.org/document/382012
Compression using the Burrows Wheeler transform
Distributed source coding using linear codes: see this or this
Wainwright et al, "Lossy source compression using low density generator matrix codes,"
Donoho, "Compressed sensing"
Gray and Neuhoff, "Quantization"
Neural network compression: this or this
In-depth description of the algorithms in data compression standards such as JPEG or MPEG

10 General comments regarding presentations/reports

You are expected to prepare a short (15min) recorded presentation summarizing the paper you have chosen. You are required to present slides (or equivalent) conveying the problem statement, important technical results, key ideas used to obtain those results, and your opinions on the paper (is this interesting/what are the limitations/what could be improved/…).

In addition to learning cool stuff in information theory, you should also learn to present your work. The final presentation is an opportunity to develop your presenting skills. You must be able to convey your ideas and thoughts clearly to another person. Here are some general comments for making good presentations:

Make sure that you acknowledge all the references/material that you used (papers/code/etc.) While it does not matter much for the purposes of this class, this is very important to keep in mind when you are delivering presentations at bigger venues
In a presentation, always introduce yourself and your team members/collaborators in the very beginning.
If much of the presentation is based on on one/two papers/references, explicitly state which references you are using in the first/second slide.
Avoid saying "We did this/we showed that…" when you are in fact talking about some other authors' work. Unless you have derived something/programmed something yourself, always say "they/the authors did/showed/observed that…"
A presentation/report is like telling a story: even before preparing the slides/document, imagine that you want to explain the work to a friend in 15-20 mins, identify exactly what you want to convey, and only then proceed.
Make sure that you explain the problem clearly.
When reading a paper, you should first understand what the problem is, what makes it challenging, and why it is worthwhile solving the problem. Then, you should identify what is new about the paper, and what the contributions are. To understand the rest of the paper, you should then try to understand the constructions/proof techniques/ideas. Your presentation should also bring out these aspects. See this article and this one on how to read a paper.
There is no single best rule for presentations, but generally in any presentation, the viewer/audience's attention is maximum in the first 5-10 minutes and then gradually decreases. So it is helpful to get to the punchline in the first few minutes.
Citations should be treated differently in papers and slides. When you are making a reference to a particular paper/book/website (for e.g., when you say that result A can be found in paper B) in your slides, mention the paper/book/website as a footnote rather than saying [1], and putting the reference in the last slide.
Do not put too much text on your slides. This is bad practice. Each slide should contain at most 3-5 points. You should absolutely avoid long sentences.
Do not copy text directly from the paper.
Do not read out your slides word-to-word.
Similarly, do not put too many lemmas in a slide. Each slide should convey a single idea. Unless very important, do not put too much math/long derivations in a short presentation. It is more important to convey ideas/intuition. But at the same time, there should be a significant amount of technical content. It is therefore very useful to instead explain using examples.
Make sure that the fonts/diagrams/labels/plots are clear. Label all the axes on your plots.
Speak clearly, and make sure that the audio quality is fine. Check your mic volume. Do a couple of dry runs.
If you are using Google meet to record, then make sure that you "pin" the slides since meet might automatically switch displaying you/your icon instead of the presentation.
Make sure that you check, and double check the slides and report for typos, technical and grammatical errors.
If your group has more than one person, then the presentation must be shared equally.