# Course Websites

## ECE 365 - Data Science and Engineering

### Last offered Spring 2021

#### Official Description

#### Related Faculty

#### Course Director

#### Goals

Big Data is all around us. Petabytes of data are collected by Google and Facebook. Twenty-four hours of video are uploaded on Youtube every minute. Making sense of all this data in the relevant context is a critical question. The goal of the course is to given the students a holistic understanding of how this data is collected, represented and stored, retrieved and computed/analyzed upon to finally arrive at appropriate outcomes for the underlying context.

#### Topics

The course is divided into three parts, with the first part focusing on foundations of machine learning, and the remaining two on specific application areas. Each application topic is covered at four discrete levels.

- We start with the context of where the data comes from, how it is acquired, what are the biases and noise levels in the data leading to statistical and physical models of the data acquired.
- Appropriate data representation mechanisms and distributed storage and computing architectures are discussed next. Based on the type of the data, different compression/ coding methods are appropriate. Images, videos, genomic data, medical imaging data, smart grid data, each bring their own unique characteristics which can be harnessed towards efficient representation.
- Once data is stored and represented efficiently, we look for the right statistical and algorithmic tools to analyze the data. Spectral methods (including Fourier methods and PCA), Clustering algorithms, SVM, Mining algorithms are studied in the specific context of the data.
- Finally, the analyzed data leads to appropriate inferences or visualizations as appropriate to the physical problem we started out with. This closes the loop bringing utility to the original setting and context in which the data was acquired.

Examples of applications topics include: Machine learning for power systems, Biological Data Analytics, Audio and Video Data Analytics, and Social Network Analytics.

#### Detailed Description and Outline

The Course Plan for the Spring 2019 offering is listed below. The application topics can change from semester to semester.

Course Plan

Part 1 (Weeks 1-5): Foundations of Machine Learning

Lecture 1: Introduction to the course; Review of Linear Algebra and Probability

Lecture 2: k-Nearest Neighbor Classifiers and Bayes Classifiers

Lecture 3: Linear Classifiers and Linear Discriminant Analysis

Lecture 4: Naïve Bayes, Kernel Tricks

Lecture 5: Logistic Regression, SVM and Model Selection

Lecture 6: K-Means Clustering and Applications

Lecture 7: Linear Regression and Applications

Lecture 8: SVD and Eigen-Decomposition

Lecture 9: Principal Component Analysis

Lecture 10: Optimization Techniques for Machine Learning, Q&A

Labs (Weeks 1-5)

Lab 1: Introduction to Python and the Canopy environment

Lab 2: Linear Classification: k-NN and LDA

Lab 3: Linear Classification: SVM

Lab 4: Clustering and Linear Regression

Lab 5: Eigen-Decompositions, SVD and PCA

Grading: 30% pre-lab quizzes (in class), 70% labs and lab reports.

Part 2 (Weeks 6-10): Smart Grid

Lecture 1: Introduction to power systems, basics of neural networks

Lecture 2: Neural networks and load prediction

Lecture 3: Power flow equations

Lecture 4: SVM for detecting corrupt power system measurements

Lecture 5: Detecting network structure

Lecture 6: Basics of electricity markets, virtual bidding

Lecture 7: Trading strategies for virtual bidding

Lecture 8: Wrapping up virtual bidding, understand customer data

Lecture 9: Logistic regression for customer data analysis

Lecture 10: Customer billing and cost savings from solar

Labs

Lab 1: Day-ahead load prediction in ERCOT markets

Lab 2: Detecting bad sensors in power system measurements

Lab 3: Virtual bidding in NYISO’s markets

Lab 4: Analyze customer data from Austin, Texas.

Grading: 30% pre-lab quizzes (in class), 70% labs and lab reports

Part 3 (Weeks 11-15): Biological Data Analytics

Lecture 1: Introduction to bioinformatics. Biological data.

Lecture 2: Sequence alignment. Global vs local alignment. Dynamic programming.

Lecture 3: The Smith-Waterman and Needlman-Wunsch algorithms. BLAST.

Lecture 4. Suffix trees and the Burrows-Wheeler transform. Bowtie2.

Lecture 5: Dynamic programming for sequence folding prediction. Vienna and Mfold. Stochastic grammars for folding models.

Lecture 6: Sanger sequencing. Overview of Next Generation and Third Generation Sequencing technologies.

Lecture 7: Basics of graph theory. Genome assembly via de Bruijn Graphs. EULER and IDBA_UD.

Lecture 8: Statistical read error-correction for Illumina, PacBio and Oxford Nanopore sequencers. Quake.

Lecture 9: Biological data repositories and databases.

Lecture 10: Biological data compression. Reference-based compression. CRAM. Context-tree weighting.

Labs

Lab 1: Sequence alignment and applications of BLAST.

Lab 2: Bowtie and DNA forensics.

Lab 3: Genome assembly. Influence of sequencing errors on assembler accuracy.

Lab 4: -Omics data compression.

Lab 5: Genomic sequence amplification and primer selection.

Grading: 30% pre-lab quizzes (in class), 70% labs and lab reports.

#### Computer Usage

All the labs are computer based using software packages such as Python and R.

#### Lab Projects

See Detailed Description and Outline.

#### Topical Prerequisites

Probability

Basic linear algebra

#### Texts

No textbook.

#### Required, Elective, or Selected Elective

Elective

#### Course Goals

Big Data is all around us. Petabytes of data are collected by Google and Facebook. Twenty-four hours of video are uploaded on Youtube every minute. Making sense of all this data in the relevant context is a critical question. The goal of the course is to given the students a holistic understanding of how this data is collected, represented and stored, retrieved and computed/analyzed upon to finally arrive at appropriate outcomes for the underlying context.

#### Instructional Objectives

At the end of this course, the student will be able apply the machine learning and data science tools gained in this course to several different types of problems involving data analytics in engineering systems and beyond. The student will also consider the broader societal impacts of the solutions, e.g., fairness in machine learning algorithms. (4)

Examples of the problems considered include:

- Given a set of labelled images corresponding to handwritten digits, the student will be able design a classifier to effectively classify a new image that is outside the data set. (1) The student will learn systematic ways to choose the best classifier among a set of choices, through the process of training, validation and testing. (1) (2) (6) (7)
- The student will learn practical applications of data analysis in system and market operations for the power grid. Physics and other practical considerations often dictate what properties to expect from data in such problems. (1) Students will learn how to exploit these properties to choose the right tool for classification and regression tasks. (2) Tools used for this part expand on the ones learnt in the first part of the course, e.g., logistic regression and support vector machines, thus allowing the students to appreciate the theory in action. (6) Finally, the student will learn how to interpret the results based on the application context, and also understand the implications of the results in a broader societal context. (4) (7)
- The student will be able to apply the machine learning and data science tools learnt in the first part of the course to perform statistical hypothesis about molecular biology (genomics). In order to do that, the student will first learn basic concepts of molecular biology (genomics). (1) Then, the student will: (i) apply data normalization techniques to genomic sequencing data, (ii) perform statistical analyses over the preprocessed data, and (iii) make biological hypothesis based on statistical tests. (2) Specifically, the machine learning concepts that the student will use to solve such problems are: data standardization, linear regression, design matrices, Expectation-Maximitation, t-tests, hypothesis testing and multiple testing correction, among others. (6)

Title | Section | CRN | Type | Hours | Times | Days | Location | Instructor |
---|---|---|---|---|---|---|---|---|

Data Science and Engineering | BB1 | 69291 | OLB | 0 | 1600 - 1650 | W | Shubham Gangil | |

Data Science and Engineering | BB2 | 69292 | OLB | 0 | 1700 - 1750 | W | Shubham Gangil | |

Data Science and Engineering | BB3 | 69293 | OLB | 0 | 1800 - 1850 | W | Shubham Gangil | |

Data Science and Engineering | BD | 69290 | OLC | 3 | 1530 - 1650 | T R | Venugopal V. Veeravalli Suma Bhat Ilan Shomorony | |

Data Science and Engineering | ZJ1 | 73162 | ONL | 3 | 0900 - 0950 | R | Venugopal V. Veeravalli Suma Bhat Ilan Shomorony | |

Data Science and Engineering | ZJ1 | 73162 | ONL | 3 | 1430 - 1550 | W F | Venugopal V. Veeravalli Suma Bhat Ilan Shomorony |