ECE 598IS - Fundamental Limits in Data Science (Spring 2022)

Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
Lectures: Tuesday and Thursday, 9:30AM - 10:50AM, location: 4070 ECEB
Instructor: Ilan Shomorony, ilans@illinois.edu. (Office hour upon request)

Announcements

  • On 01/27 (second in-person lecture), we will meet in 4070 ECEB.
  • On 01/25 (first in-person lecture), we will meet in 3026 ECEB.
  • Information about scribing can be found here
  • The lectures in the first week (01/18 and 01/20) will be held via Zoom (Zoom info)


Summary:
This course will expose students to the probabilistic modeling of several data science problems and to the characterization of fundamental performance limits (e.g., scaling laws and information-theoretic feasibility). Students will learn techniques from high-dimensional statistics, information theory, graph theory, and random matrix theory through a combination of standard lectures and reading research papers. The first part of the course will explore the use of information-theoretic tools in high-dimensional statistics. We will then focus on applications in three broad classes of data science problems: clustering, sequence reconstruction and dimensionality reduction. The mathematical models studied will be motivated by specific practical data science problems, particularly from the field of genomics.

Prerequisite: Main prerequisite is ECE 534 (Random Processes) or equivalent. ECE 563 (Information Theory) or some prior exposure to information theory thinking is recommended.

Textbook: No textbook will be required. Lectures will be based on the following sources (and on research papers):
  • Martin Wainwright, "High-Dimensional Statistics: A Non-Asymptotic Viewpoint"
  • Yihong Wu, "Information-theoretic Methods for High-dimensional Statistics" (lecture notes)
  • Yihong Wu and Jiaming Xu, "Statistical inference on graphs" (lecture notes)
  • John Duchi, “Information Theory and Statistics” (lecture notes)
  • Emanuel Abbe, “Community Detection and Stochastic Block Models”
  • Rodrigues and Eldar, “Information-Theoretic Methods in Data Science”

Topics:

Part 1: Statistics and Information Theory
  • Basics of Statistical Decision Theory
  • Bayes risk and minimax risk
  • Divergence measures and inequalities
  • Minimax lower bounds (Le Cam and Fano methods)
  • Recovery of sparse vectors
Part 2: Fundamental Limits in Data Science Problems
  • Inference on graphs
    • Spectral methods and random matrix perturbations
    • The planted clique problem
    • Stochastic Block Model
  • Sequence reconstruction
    • Information-theoretic limits of sequence assembly
    • Trace reconstruction problem
  • Dimensionality reduction
    • Sparse PCA
    • Min-hashing and sketching

Grading

  • Scribing - 15%
  • Homework - 40%
  • Paper presentation - 20%
  • Class participation - 5%
  • Project - 20%