CS 537

General note on CA 537 homework assignments: In each homework, a proposition will be shared (e.g., “AI will kill us all”). Groups will adopt a view (pro/anti) regarding this proposition. The homework for each group will be to support their adopted view with arguments. Groups will debate their arguments in class in front of an audience. A class poll on the proposition (pro/anti) will be conducted before and after the debate. The goal of each side in the debate is to recruit people to their view. To ensure that each debate has some "more neutral" audience, we shall carry out two debates each time, each debate is assigned to half the groups. The other half are the “audience” for that debate. In each debate, the debating groups will try to “recruit” members of other groups to their view (importantly, including the audience). Presumably, convincing the audience is easier since they might not have as strong of an initial opinion on the issue and may thus be easier to influence by the arguments made in the debate.
.
Homework 4:
Odd numbered groups (Debate #7): You are developing an advanced calorie counting application (counting spent calories on various physical activity) that achieves a significantly higher accuracy at estimating the user's energy expenditure in a wide set of exercise activities and general activities of daily living. To facilitate future expansion, the architecture comprises a self-supervised encoder followed by a fine-tuned regression head that maps from the latent space onto a calorie expenditure estimate. Which encoder design would you choose, DiffPhys or PhyMask? Support your view with 1 to 3 arguments. Submit your answer via canvas (see canvas link on class homepage) by noon of the day of the debate (Tuesday 3/10). The first line of the answer should indicate the view: "DiffPhys" or "PhyMask". The rest of the text should present the arguments. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.

Even numbered groups (Debate #8): A key challenge in IoT is the alignment of sensor data of different modalities into the same latent space, referred to as multimodal alignment. Two issues make multimodal alignment difficult in IoT contexts. First, unlike the case with generative AI usecases, where we might want to generate, say, images from text (via a shared latent representation) by essentially "making up" image features to fill-in details not represented in the text, in IoT the goal is to faithfully represent the environment. Making up information not explicitly sensed is not desirable. Second, it is often the case that individual input training data samples contain only a subset of sensing modalties per sample. In short some data are not aligned, although collectively the samples cover all modalities. What encoder architecture would your group choose for modality alignment that best meets the needs of IoT data: An encoder based on contrastive learning or one based on masked auto-encoding? Support your view with 1 to 3 arguments. Submit your answer via canvas (see canvas link on class homepage) by noon of the day of the debate (Tuesday 3/10). The first line of the answer should indicate the view: "Contrastive" or "Mased Auto-Encoding". The rest of the text should present the arguments. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.

Archived:

Homework 3:
Odd numbered groups (Debate #5): An AI application is used to automate the control of a chemical reaction process. The application reads multiple continuous real-time streams of (low-dimensional) chemical sensor data as input and computes actuation commands to control various process parameters as output. You need to design an appropriate tokenizer for the input streams. Options are fixed-length tokenization (every N samples of a stream constritute a token) or variable-length tokenization (a token comprises X samples, where X is a variable that is adjusted to smaller values during periods of high data volatility and higher values during periods of relative stability). You have the freedom to deign the variable-length tokenization scheme or choose N for the fixed-length tokenization scheme.As a group, which scheme would you use? Support your view with 1 to 3 arguments. Submit your answer via canvas (see canvas link on class homepage) by noon of the day of the debate (Tuesday 3/3). The first line of the answer should indicate the view: "Fixed-length tokenization" or "Variable-length tokenization". The rest of the text should present the arguments. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.
Concluding comments: The debate concluded that variable-length tokenization is better because many systems exhibit long periods of inactivity interspersed with more active intervals. Variable-length tokenization can ensure that inactive periods are encoded into fewer tokens, whereas active periods are allotted more tokens. Thus, tokens will, ingeneral, convey similar amounts of information, as opposed to representing equal numbers of raw data samples. In turn, savings in the number of tokens (during quiet periods) will lead to reductions in overall resource consumption. A dissenting argument employed a control-theoretic observation. Namely, it is well known in control theory that controller design is greatly simplified when the input to the controller constitutes a fixed-rate sampling of the controlled process. Variable-rate inputs greatly complicate control dynamocs. Thus, fixed-rate tokenization will likely allow for a significantly simpler contoller (neural network) design without jeopardizing closed loop performance. In highly dynamic systems (e.g., open loop unstable systems) there are no "quiet periods". Thus, the advantages of varable-length tokenization are overshadowed by those of fixed-length tokenization.

Even numbered groups (Debate #6): You are to design an encoder (that translates input data into a suitable latent representation) for the application mentioned in Debate #5 above.The chemical process exhibits long-range dependencies, where the current state may depend in part on past conditions transpiring as far back as 3 hours ago. The sampling rate is 100 samples per second. Would you use an encoder that is based on transformers (self-attention modules) or based on structured state space models? As a group, which encoder would you use? Support your view with 1 to 3 arguments. Submit your answer via canvas (see canvas link on class homepage) by noon of the day of the debate (Tuesday 3/3). The first line of the answer should indicate the view: "Transformer" or "Structured State Space Model". The rest of the text should present the arguments. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.
Concluding comments:
Homework 2: All groups concluded that structured state space models are better. This is a generally valid answer. No one explored the possibility of using adaptive (variable-length) tokenization as a means to countact the inherent limitations of transformers. If the tokenizer is smart enough to only tokenize "relevant events", it might be that the complexity of downstream self-attention can be significantly reduced.

Debate #3: Consider the specific case of AI-assisted human exercie activity recognition using accelerometer measurements of wearable devices such as smart watches, fitbit-like devices, etc. (Examples of exercise activities include push-ups, squats, weight-lifting, bench-pressing, running, walking, jogging, rowing, stair climbing, stretches, zumba, pilates, etc.) As a group, would you train your AI to use raw time-series accelerometer data as input or spectrogram data as input to minimize the cost of training? Support your view with 1 to 3 arguments. Email me (zaher@illinois.edu) those arguments as a bullet list in the body of an email by noon of the day of the debate (Tuesday 2/24). The subject line should be: "CS 537, <G#>, <View>", where <G#> is your group number (such as G1 or G5) and <View> is the word time-seires or spectrogram. One email per group (with CC to group members) is enough. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.
Concluding comments: The debate concluded that sepctrograms are better in the context of regognizing repetitive exercise activity, as argued by G3. Several reasons where mentioned:

1. Concise representation: For repetitive phenomena, spectrograms are a much more concise representation of the data. As such, they offer a computational advantage when representing rhythmic and/or periodic activity, such as exercise. The sparser representation leads to a simpler and more economic model.
2. Noise tolerance: Spectrograms, by averaging the instances of the periodic activities within each time window, also inherently filter out noise in measurements of the specific instances. They also make it easier to remove noise that exists in frequency bands that are less relevant to the main signal. For example, a walking person takes 2-3 steps per second. If the spectrogram has components at much higher frequencies (say 30 Hz), they are likely not attributed to the human activity and can be removed as noise.

Opposing opinions mentioned several good arguments as well. For example:

1. Phase information: pectrograms do not carry phase information, which is true but is not a fundamental limitation because Fourier transform (used to compute the spectrograms) does inherently include phase (which can therefore be used together with the magnitude if need be).
2. Approximation of single instances: Spectrograms are inherently an approximation. The approximation makes it harder to analyze individual instances of the activity, such as indiviual strokes, punches, tennis serves, baseball pitches, golf swings, etc. Thus, if applied more broadly to sports (not just repetitive exercise activity recognition), they are inferior at in-depth performance analysis and improvement. For high-end athlete training systems (that focus on analyzing and perfecting each stroke, punch, pitch, or swing), time series analysis is better.
3.Two-dimensional: It was also argued that spectrograms being 2-dimensional are a more resource-consuming representation than the oriignal time-series, which is one-dimenional. This happens to be incorrect. For an N-sample time-series, there are at most N coefficient in the short-time Fourier transform, so the two representations are identical in the amount of storage needed.

Debate #4: For the above AI-assisted human exerise activity recognition problem, your company decides to use an auto-encoder architecture as the neural network backbone. You are tasked with finding an appropriate loss function to train the auto-encoder weights/parameters. Would you use reconstruction loss or masked auto-encoding loss? Support your view with 1 to 3 arguments. Email me (zaher@illinois.edu) those arguments as a bullet list in the body of an email by noon of the day of the debate (Tuesday 2/24). The subject line should be: "CS 537, <G#>, <View>", where <G#> is your group number (such as G2 or G4) and <View> is the words reconstruction or masked auto-encoding. One email per group (with CC to group members) is enough. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.
Concluding comments: All groups correctly identified masked auto-encoders as superior. There was no debate.

Homework 1:
Debate #1: As a group, take a pro or anti position with respect to the following proposition: "Contrastive Learning is generally better than Masked Auto Encoding as a self-supervised representation learning framework for IoT applications that feature rare or esoteric data modalties". Support your view with 1 to 3 arguments. Email me (zaher@illinois.edu) those arguments as a bullet list in the body of an email by noon of the day of the debate (Tuesday 2/17). The subject line should be: "CS 537, <G#>, <View>", where <G#> is your group number (such as G1 or G4) and <View> is the word pro or anti. One email per group (with CC to group members) is enough. Be prepared to defend your arguments in class against counter arguments, so while I am fine with you forming your opinion with the help of AI, make sure that you own it and can defend it in real time against opposing arguments.
Concluding comments:
First: Key considerations in choosing between encoder training techniques:
1. Training resource efficiency: Clearly, there is a preference for more efficient encoder training. However, when the amount of training data is limited, resource efficiency becomes less of a concern, since training is a one-time process anyway, and given a relatively small data set, the training overhead is not a limiting factor.
2. Training data efficiency: Many IoT sensing modalities are highly specialized. Thus, vast amounts of training data are not readily available. In this case, a key consideration is the data efficiency of encoder training. Encoders that learn from less data are preferred.
3. Inference efficiency: Clearly, faster inference is preferred.
Second: Key differences between contrastive learning and masked auto-encoding:

1. Learning Objectives:
- Contrastive learning is trained by contrasting similar and dissimilar sample pairs. Thus, it creates representations of inputs that explicitly encode "similarity/difference" between samples. More similar samples get closer in the latent space. In short, it encodes inter-sample relations.
- Masked auto-encoding is trained on individual samples (not sample pairs). It learns to extract higher-level semantics from each sample. In short, it encodes intra-sample semantics.
2. Inductive Bias:
- Contrastive learning needs to be told the meaning of "similarity" in the application domain of interest. In other words, the design of the loss function explicitly teaches the encoder to ignore certain types of differences (or augmentations) while emphasizing others. Thus, it offers a notion of similarity that is biased by the design of the loss function (which is often called inductive bias)
- Masked auto-encoding does not explicitly inject inductive bias (except very indirectly in the design of the masking policy).
Third: Pros and Cons of the two approaches (summary of your answers):
- In contrastive learning, one has to be very careful in choosing the "right" augmentations and the right loss function, not to teach the encoder the "wrong" notion of similarity. Proper augmentation design is non-trivial and requires domain expertise. If a contrastive loss function is not defined sufficiently well, the encoder will not perform well. In contrast, MAE can be applied without knowing the physical domain or having to define specific notions of similarity.
- In principle, having to think of application-specific augmentations and loss functions (to encode the "right" notion of similarity in contrastive learning) goes against the philosophy of self-supervised learning that emphasizes learning from data (unassisted by domain expertise).
- From a task perspective, contrastive learning is particularly well-suited for discrimination and classification because it explicitly shapes embedding geometry around similarity relationships. For IoT applications that extend beyond classification (e.g., include regression, localization, tracking, etc), the binary similar/dissimilar contrast might not be the best training method.
- Contrastive learning is often thought to be more resource-consuming during training.
Fourth: Putting it all together - Concluding Remarks
- Since the key bottleneck in specialized IoT sensing modalities is the data bottleneck, approaches that reduce the size of training data are highly advantageous. It is in this context that the "disadvantages" of contrastive learning become its strongest asset: While design of augmentations and loss functions needs domain expertise, with a good understanding of the application domain, it is possible to inject application bias in contrastive learning that significantly reduces the need for large training data. This is because it is easier to train something that works for the one domain, as opposed to something that is domain agnostic (i.e., does not quite know what domain it will be applied in, and thus must learn something more general). This ability to specialize and thus save on training data is a strong argument in favor of contrastive learning for many IoT modalities.
- The affinity of contrastive learning to classification/categorization tasks as opposed to regression tasks remains a potential concern.
- The argument about the resource efficiency of training is less consequential when the training data set is not very large (as is usually the case with specialized IoT data modalities). Hence, this is not a big consideration in making a choice.
- While contrastive learning and masked auto-encoders differ in how they train the encoder, both can be applied to train the same neural network (encoder) architecture. Thus, there is no inherent difference in inference efficiency.
- In short, considerations of data efficiency dominate, often making contrastive learning a great choice.
Debate #2: Do the same with respect to the following proposition: "In IoT contexts, a transformer-based encoder architecture will generally outperform smaller and simpler neural networks (that feature a combination of convolutional and recurrent layers) at creating representations suitable for downstream classification tasks".
Conclusing comments: The debate largely concluded that the proposition is false, at least at the time being. Smaller and simpler models have the advantage of needing less data to train, which is a big deal in IoT systems that are plagued by a data bottleneck. Training a bigger model with insufficient data results in overfitting (i.e., lack of ability to generalize from training data to testing data) and thus poor performance compared to models that do not overfit. A dissenting opinion was that the data bottleneck may not last forever. As more IoT data become available for training, it will eventually become feasible to train larger models without overfitting, at which point such models will outperform simpler ones in capabilities. Another dimension considered was cost. It is not always clear that using a larger model is the best option for the highly resource-limited IoT systems. Such models may need to run remotely in the cloud due to their high computational demand, which may impose its own performance penalties, especially in applications that require fast and reliable action. A dissenting opinion may be that with advances in available network benadwidth, the latency of remote access is going down. Thus, running models remotely will become increasingly more feasible even for latency-sensitive applications.

CS 537 (AIoT),

Why are IoT Applications and Cyber-physical Systems Important?