Shashank Vatedka

Research

I work in the broad areas of information theory and coding, with interests in data compression, security, and statistical inference.

Much of my work is mathematical in nature, and I expect students who want to work with me to be comfortable with probability, linear algebra/matrix theory and programming in C/C++/python.

Distributed inference and learning with limited communication

Classical approaches to statistical inference and machine learning assume that all data is available at a central location. However, with recent concerns about privacy has brought about new approaches like federated learning.

If we take the example of learning user preferences (rating movies, for example), it would be helpful if a central server could learn the model without accessing all user data. In addition to privacy issues, there is also the cost of communicating data from each user to the server.

Take the example of a wireless sensor network where we have multiple low cost sensors measuring certain parameters, and a central server (sometimes called a fusion center) wants to infer something from these measurements (maybe the average of all the measurements). The link from the sensors to the server will in general be noisy and communication-limited. In such a scenario, how do we design a set-up to perform reliable inference at the server?

See this paper and this one for recent work on distributed mean estimation with limited communication. See this paper for the broad scope.

Compression with locality constraints

We are familiar with utilities like zip and rar to compress files. These are what we call lossless compressors, since we can recover the original (what I call the ``raw’’ file) from the compressed zip or rar file.

These compressors do a good job for standard applications where the main goal is archival: compress all the files, and then recover all of them at some later point of time. Think of backups. However, for applications where we have huge amounts of data (e.g., genomic/astronomical/time series) occupying several terabytes, we want to compress these to save space, but we also want to make frequent accesses/edits/computations on the data directly in the compressed domain. As a toy example, let us say that I have 10,000 pdf files that I want to compress in order to save space. If I compress all of these to create a single zip file, then every time I want to read even a single pdf file, I have to decompress all the 10,000 files. This is suboptimal, in terms of time and computational resources. Now, I could compress each of the 10,000 files separately, but then it turns out that compression algorithms work better on large amounts of data (try comparing the compressed file sizes for a 1 kB text file versus a 10MB text file).

The question therefore is, can we obtain the same level of compression that we could obtain by compressing the 10,000 files together, but still be able to recover individual files at speeds close to the case where we compressed them separately? It turns out that this is theoretically possible: one can compress data of length \(n\) at rates close to entropy and yet decompress short fragments of size \(s\) in \(O(s)\) time rather than \(O(n)\) time! One then wonders what other processing can be done directly in the compressed domain: we want to save space, but not at the cost of processing time and computational complexity. Can we get the best of both: optimal compression and almost no extra computational requirements for processing the compressed data directly?

See this video of a talk I gave at a summer school for a more technical introduction of the work we’ve done. Also see this video for more recent work. See this paper and this one for more details.

Communication in the presence of adversaries

Consider the standard problem of communicating reliably across a noisy channel. Suppose that Alice wants to send a k-bit message to Bob. The standard approach is to model the effects of noise as a random process, and the most common model is the discrete memoryless channel wherein the input and output of the channel are discrete sequences and the noise is assumed to be an independent and identically distributed (iid) process. This is the most fundamental (and well-studied) setup, and most communication systems today are designed based on these assumptions. As a toy model, assume that Alice can send a sequence of bits through the channel and Bob receives the same number of bits, but some of them could be flipped. The binary symmetric channel (abbreviated as BSC) with crossover probability \(p\) is one where each bit transmitted by Alice may be independently flipped (\(0\) is received as \(1\) and \(1\) received as \(0\)) with probability \(p\), or received as-is with probability \(1-p\).

To protect the message from the corruptions made by the channel, Alice uses an error correcting code or a channel code, an algorithm that converts the \(k\) message bits into \(n\) codeword bits which ensures that even after the channel corrupts the codeword, Bob is still able to recover the original message with high confidence. Since the channel is random, it is impossible to recover the message exactly all the time, but the hope is that the probability that the message is decoded incorrectly is small. In fact, most communication systems (including our cell phones and WiFi devices) employ state-of-the-art error correcting codes which ensures that we can converse over long distances or watch videos online even though our devices are communicating across very noisy media. Claude Shannon (way back in 1948) showed that it is possible to have near-perfect communication (for large \(k\)) as long as \(k/n\) is less than a quantity called the channel capacity. He also showed (mathematically) that if the rate equal to \(k/n\) is above the channel capacity, then it is impossible to have reliable communication). He gave a mathematical proof showing that this was the fundamental limit of reliable communication, but it was only very recently (post 1990) that practical algorithms have been developed that allow us to communicate near-error-free at rates close to the channel capacity.

All of this theory works well when the channel is random, and acts independently of the transmitted signal. Now, suppose that the noise is not random, but is instead added by a malicious adversary/jammer (let us call him James). In the toy example, let us say that James is allowed to flip any, but at most \(np\) (for some \(0\le p\le 0.5\)) of the \(n\) transmitted bits. Then, what is the maximum rate at which we communicate while ensuring that Bob is able to almost always recover the message correctly? If James can flip the bits based only on the knowledge of the channel code being used, then it turns out that it is possible to operate at rates close to the capacity of the BSC. However, if James also knows the transmitted codeword (i.e., he is operating in a full-duplex mode where he can send and receive at the same time), then the optimal rate is still unknown.

We work on various channel models involving jamming adversaries, where the adversary has partial information about the signal transmitted by Alice. In some cases, we can characterize the maximum rate of reliable communication. See this paper and this one for some of our work. We also study a variant of the problem, where Bob only needs to output a small list of messages with the guarantee that the transmitted message is one in this list. See this paper and this one for some of our results for the list decoding problem.

Information-theoretic security

My PhD work was on secure communication and secret key generation under information-theoretic security constraints.

The secret key generation problem is the following: Suppose that multiple users have private access to correlated data. These could be correlated sensor measurements, or anything else for that matter. Assuming that these users are allowed to communicate with each other through a noiseless public channel, how do we design a communication protocol such that at the end of the communication, the users can agree on a common string (the secret key) \(K\), whereas a computationally unbounded eavesdropper who has access to only the public communication learns essentially nothing about \(K\)? Specifically, we designed a protocol and theoretically characterized the achievable secret key rates when the correlated data are distributed according to a Gaussian Markov tree, and the secrecy constraint is that the mutual information between \(K\) and all of the public communication is arbitrarily small for a large number of data samples (sometimes called the strong secrecy constraint). See this paper to know more.

We also looked at the following problem, where two users \(A\) and \(B\) want to exchange messages through a relay. There is no direct link between the users, and the channel from the users to the relay is a Gaussian MAC. In such a setting, is it possible to have information-theoretically secure exchange of messages? In particular, if user \(A\) has a message \(M_A\), user \(B\) has \(M_B\), both of which are uniformly distributed over the message set, is it possible to design channel codes such that at the end of the communication the messages are reliably exchanged, but the relay (honest-but-curious: follows the given protocol, but tries to passively get as much information as possible) is unable to gain any information about the individual messages? We showed that it is indeed possible to ensure that the signals received by the relay, and the messages are pairwise independent of each other. See this paper to know more.

Lattice codes