Loading [MathJax]/jax/output/HTML-CSS/jax.js

CS440/ECE448 Fall 2024

Assignment 10: Neural Nets and PyTorch

Due date: Monday November 11th, 11:59pm

image from Wikipedia

The goal of this assignment is to employ neural networks, nonlinear and multi-layer extensions of the linear perceptron, to classify images into four categories: ship, automobile, dog, or frog. That is, your ultimate goal is to create a classifier that can tell what each picture depicts.

In part 1, you will create an 1980s-style shallow neural network. In part 2, you will improve this network using more modern techniques such as changing the activation function, changing the network architecture, or changing other initialization details. Also, there is an optional part where you can submbit your arthitecture to leaderboard. The leaderboard is only for bragging rights and is worth no points.

You will be using the PyTorch and NumPy libraries to implement these models. The PyTorch library will do most of the heavy lifting for you, but it is still up to you to implement the right high-level instructions to train the model.

You will need to consult the PyTorch documentation, linked multiple times on this page, to help you with implementation details. We strongly recommend this PyTorch Tutorial, specifically the Training a Classifier section since it walks you thorugh building a classifier very similar to the one in this assignment. We have made a notebook that walks you through the basics of PyTorch. You can find it here. (If your browser shows you a wall of mystery code, save this out as a file with the ipynb extension and then power up Jupyter.) We strongly recommend reviewing this notebook, as it includes a reshaping example, demonstrates the construction of PyTorch models (closely resembling the one required here), and provides an example of a convolutional network, which you will need for Part 2. There are also other guides out there such as this one.

Template package and submission

The template package contains your starter code and also your training and development datasets (packed into a single file).

In the code package, you should see these files:

reader.py - This file is responsible for reading in the data set. It creates a giant NumPy array of feature vectors corresponding to each image.
mp10.py - This is the main file that starts the program, and computes the accuracy, and confusion matrix using your implementation.
neuralnet_part1.py - This is the file where you will be doing all of your work for part 1.
neuralnet_part2.py - This is the file where you will be doing all of your work for part 2.

Modify only neuralnet.py.

Run python3 mp10.py -h in your terminal to find more about how to run the program. There are two parts to this assignment that are been graded, and you need to run them separately (but submit them together to gradescope).

You will need to import torch and numpy. Otherwise, you should use only modules from the standard python library. Do not use torchvision.

The autograder doesn't have a GPU, so it will fail if you attempt to use CUDA.

Dataset

The dataset consists of 3000 31x31 colored (RGB) images (a modified subset of the CIFAR-10 dataset, provided by Alex Krizhevsky). This set is split for you into 2250 training examples (which are a mostly balanced sample of cars, boats, frogs and dogs) and 750 development examples.

The function load_dataset() in reader.py will unpack the dataset file, returning images and labels for the training and development sets. Each of these items is a Tensor.

Training and dataloaders

Two of the inputs to your main training function (fit) are the number of epochs and the batch size. You should run your training process for that many epochs on the training set. In each epoch your code needs to process ALL the training data, batch by batch, each batch having the specified size (failing to include every batch in each epoch will seriously harm your performance as there will be too few gradient updates to get decent performance). If you try to do this by hand, you'll run into issues such as what to do when the dataset size isn't a multiple of the batch size.

To make this simple, you will use a pytorch dataloader. A dataloader is a way for you to handle loading and transforming data before it enters your network for training or prediction. It will let you write code that looks like you're just looping through the dataset, with the division into batches happening automatically. Details on how to use a dataloader can be found in this tutorial by Shervine Amidi. We have provided an auxiliary function (get_dataset_from_arrays(X,Y)) in utils.py that converts tensors of features (X) and labels (Y) into a simple torch dataset class that can be loaded into their dataloaders for your convenience. The dataloader will return a dictionary, not a tuple. You have to get the labels and features out of that dictionary.

To get consistent output from the autograder, make sure to set "shuffle = False" on your dataloader.

Confusion Matrix

The top-level program mp10.py returns three types of feedback about your model.

The accuracy on the dev set
A confusion matrix for the dev set
The total number of parameters in your network

A confusion matrix is a very useful tool for evaluating multi-class classification problems, as it helps in identifying possible sources of imbalance in your dataset - and can offer precious insights into possible biases in your training procedure.

Specifically, in a classification problem with k classes, a confusion matrix will have k rows and k columns. Each row corresponds to the ground truth label of the data points - and each column refers to the predicted label by your classifier. Each entry on the matrix contains a count of the corresponding tuple (ground_truth label, predicted_label). In other words, all elements in the diagonal of this square matrix have been correctly classified - and all other elements count as mistakes. For instance, if your matrix has many entries in [0,1]. This will mean that your classifier tends to mistake points belonging to class 0 for points belonging to class 1. Further details can be found on this Wikipedia page.

Part 1

Classical Shallow Network

The basic neural network model consists of a sequence of hidden layers sandwiched between an input and output layer. Input is fed into it from the input layer and the data is passed through the hidden layers and out to the output layer. Induced by every neural network is a function

$F_{W}$ which is given by propagating the data through the layers.

To make things more precise, in lecture you learned of a function $f_{w}(x) = \sum_{i=1}^n w_i x_i + b$ . In this assignment, given weight matrices $W_1,W_2$ with $W_1 \in \mathbb{R}^{h \times d}$ , $W_2 \in \mathbb{R}^{4 \times h}$ and bias vectors $b_1 \in \mathbb{R}^{h}$ and $b_2 \in \mathbb{R}^{4}$ , you will learn a function $F_{W}$ defined as $F_{W} (x) = W_2\sigma(W_1 x + b_1) + b_2$ where $\sigma$ is your activation function. In Part 1, you should use either of the sigmoid or ReLU activation functions.

For this dataset, you have 2883 input values, one for each channel of each pixel in an image. That is $d=(31)^2(3) = 2883$ . You should be able to pass the Part 1 tests with no more than 200 hidden units. That is you should have $h \le 200$ .

Training and Development

Training: To train the neural network you are going to need to minimize the empirical risk $\mathcal{R}(W)$ which is defined as the mean loss determined by some loss function. For this assignment you should use cross entropy for that loss function. In the case of binary classification, the empirical risk is given by $\mathcal{R}(W) = \frac{1}{n}\sum_{i=1}^n y_i \log \hat y_i + (1-y_i) \log (1 - \hat y_i) .$ where $y_i$ are the labels and $\hat y_i$ are determined by $\hat y_i = \sigma(F_{W}(x_i))$ where $\sigma(x) = \frac{1}{1+e^{-x}}$ is the sigmoid function. For this assignment, you won't have to implement these functions yourself; you can use the built-in PyTorch functions. Notice that because PyTorch's CrossEntropyLoss incorporates a sigmoid function, you do not need to explicitly include an activation function in the last layer of your network.

Development: After you have trained your neural network model, you will have your model decide whether or not images in the development set decide what is the class of each image. This is done by evaluating your network $F_{W}$ on each example in the development set, and then taking the index of the maximum of the four outputs (argmax).

Data Standardization: Convergence speed and accuracies can be improved greatly by simply centralizing your input data by subtracting the sample mean and dividing by the sample standard deviation. More precisely, you can alter your data matrix $X$ by simply setting $X:=(X-\mu)/\sigma$ . Notice that you are standarizing a feature value across all images, not standardizing a feature value relative to the other features in the same image. Be sure to standardize the dev data just exactly like you standardized the training data. This standardization should be done in the fit() function, not the forward() function. Do this right at the start!

Notice that the autograder will pass in the number of training epochs and the batch size. You don't control those. However, you do control the neural net's learning rate. If you are confident about a model you have implemented but are not able to pass the accuracy thresholds on gradescope, try increasing the learning rate. Be aware, however, that using a very high learning rate may worse performance since the model may begin to oscillate around the optimal parameter settings.

With the above model design and tips, you should expect around 0.62 dev-set accuracy.

Digging into the Skeleton Code

The file neuralnet.py gives you a NeuralNet class which implements a torch.nn.module. This class consists of __init__(), forward(), and step() functions. The main function fit() will use these to train the network and then classify the images from the test/development set.

`init()`

__init__() is where you will need to construct the network architecture. There's two ways to do this:

Use the Linear and Sequential objects. Keep in mind that Linear uses a Kaiming He uniform initialization to initialize the weight matrices and sets the bias terms to all zeros.
Alternatively, you can explicitly define weight matrices W1, W2, ... and bias terms b1, b2, ... by defining them as Tensors. This approach is more hands on and will allow you to choose your own initialization. For this assignment, however, Kaiming He uniform initialization should suffice and should be a good choice.

Additionally, you should initialize an optimizer object in this function to use to optimize your network in the step() function.

Look at the examples in the PyTorch Tutorial.

`forward()`

forward() should perform a forward pass through your network. This means it should explicitly evaluate

$F_{W}(x)$ . This can be done by simply calling your Sequential object defined in __init__() or (if you opted to define tensors explicitly) by multiplying through the weight matrices with your data.

`step()`

step() should perform the gradient update through one batch of training data (not the entire set of training data). You can do this by either calling loss_fn(yhat,y).backward() then updating the weights directly yourself, or you can use an optimizer object that you may have initialized in __init__() to help you update the network. Be sure to call zero_grad() on your optimizer in order to clear the gradient buffer.

When you return the loss_value from this function, make sure to convert it to a plain number. This allows proper garbage collection to take place, so that your program won't consume excessive amounts of memory. Two options:

Return loss_value.item(). This works if it is just a single number.
Or use float(loss_value.detach().cpu().numpy()). This which separates the loss value from the computations that led up to it (detach is really important, otherwise you will run out of memory), moves it to the CPU (e.g. if you are using a GPU locally), and then converts it to a NumPy array.

Remember that Gradescope won't have a GPU.

`fit()`

fit() should construct a NeuralNet object, and iteratively call the neural net's step() function to train the network. The inputs to fit() tell you the batch size and how many training epochs you should use. fit() should then run the neural net on the development set and return 3 things: a list of the losses for each epoch of training, a numpy array with the estimated class labels (0, 1, 2, or 3) for the dev set, and the trained NeuralNet network.

Part 2

CNN Network

In this part, you will try to improve your performance by employing modern machine learning techniques, specifically using Convolutional Neural Networks (CNNs) followed by a couple of fully connected layers. These include, but are not limited to, the following:

Choice of activation function: Some possible candidates include Tanh, ELU, softplus, and LeakyReLU. You may find that choosing the right activation function will lead to significantly faster convergence, improved performance overall, or even both.
Using Convolutional Neural Networks: For image classification tasks, it is often best to use convolutional neural networks, which are tailored specifically to signal processing tasks such as image recognition. You should improve your results using convolutional layers in your network, followed by a couple of fully connected layers. Note: The input to a CNN layer must be 4-dimensional: (batch_size, channel_num, height, width). Since your original input to the network is 2D (batch_size, 31*31*3), you'll need to reshape it before feeding it to the CNN layer.
Variation of CNN Parameters: To get the CNN layers to work better, you can experiment with various parameters such as the number of convolutional layers, the number of filters in each layer, the kernel size, stride, padding, and the use of pooling layers. Adjusting these parameters can help improve the performance and accuracy of your network.

Using CNN Layers with Reshaped Input

In this section, you will learn how to reshape your input and feed it into a CNN layer. When working with Convolutional Neural Networks (CNNs), the input must be 4-dimensional, typically of the form:

(batch_size, num_channels, height, width)

Previously, you were using 2D inputs of shape (batch_size, 31*31*3). To make it compatible with a CNN layer, you need to reshape the input to (batch_size, 3, 31, 31), where:

batch_size: Number of examples processed together in a batch.
num_channels: Number of color channels (3 for RGB).
height and width: Spatial dimensions of the input.

PyTorch Example of Reshaping Input

Below is an example of how to reshape your input using PyTorch:


import torch
import torch.nn as nn

# Example input of shape (batch_size, 31*31*3)
batch_size = 64
input_tensor = torch.randn(batch_size, 31 * 31 * 3)

# Reshape input to (batch_size, 3, 31, 31)
reshaped_tensor = input_tensor.view(batch_size, 3, 31, 31)

# Define a simple CNN layer
cnn_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)

# Pass the reshaped input through the CNN layer
output = cnn_layer(reshaped_tensor)

print("Output shape:", output.shape)  # (batch_size, 16, 31, 31)

How CNN Layers Change the Input Dimensions

When the input passes through a CNN layer, the height, width, and number of channels can change based on the layerâ€™s parameters. A convolutional layer applies a sliding filter over the input, extracting features and reducing or changing its spatial size. The number of filters used in the layer determines the number of output channels. Below is the formula to calculate the output size for the height or width after convolution:

Output Size = ((Input Size + 2 * Padding - Dilation * (Kernel Size - 1) - 1) // Stride) + 1

Letâ€™s break down each parameter in this formula:

Input Size: Original height or width of the input.
Kernel Size: Size of the filter (e.g., 3x3, 5x5).
Dilation: Spacing between kernel elements, where a dilation of 1 means adjacent elements are used, and higher values spread them further apart.
Padding: Extra pixels added around the border of the input to control the output size. Zero-padding ensures that features near the edges are considered.
Stride: Steps the filter takes as it moves over the input. A stride of 1 means the filter shifts by one pixel at a time, while larger strides reduce the output size.

Example: Suppose the input size is (batch_size, 3, 31, 31), with a convolutional layer having:

Kernel Size = 3
Padding = 1
Dilation = 1
Stride = 1

The output size for height and width will be:

((31 + 2*1 - 1*(3 - 1) - 1) // 1) + 1 = 31

Thus, the output will maintain the same spatial dimensions, resulting in (batch_size, 16, 31, 31), where 16 is the number of filters used in the CNN layer. Each filter learns to detect a specific feature, such as edges or textures, allowing the network to capture various aspects of the input data.

The leaderboard

For your own enjoyment, we have provided also an anonymous leaderboard for this MP. Even after you have full points on the MP, you may wish to try even more things to improve your performance on the hidden test set by tuning your network better, training it for longer, using dropouts, data augmentations, etc. For the leaderboard, you can submit the net.model and state_dict.state created with your best trained model (after running mp10.py --leaderboard; the model architecture is taken from neuralnet_leaderboard.py so it is indepedent from your main submission) alongside neuralnet_leaderboard.py (note that these need not implement the same thing, as you could wish to do fancy things that would be too slow for the autograder). We will not train this specific network on the autograder, so if you wish to go wild with augmentations, costly transformations, more complex architectures, you're welcome to do so. Just do not exceed the 500k parameter limit, or your entry will be invalid. Also, please do not use additional external data for the sake of fairness (i.e. using a resnet backbone trained on ImageNet would be very unfair and counterproductive).

Some tips

Try to make your classification accuracy as high as possible, subject to the constraint that you may use at most 500,000 total parameters. This means that if you take every floating point value in all of your weights including bias terms, i.e. as returned by this pytorch utility function, you only use at most 500,000 floating point values.

You should be able to get an accuracy of at least 0.79 on the development set.

If you're using a convolutional net, you need to reshape your data in the forward() method, and not the fit() method. The autograder will call your forward function on data with shape (N,2883). That's probably not what your CNN is expecting. It's very helpful to print out the shape of key objects before/after each layer when trying to debug dimension issues. (the .view() and .reshape() methods from tensors are very useful here)

Apparently it's still possible to be using a 32-bit environment. This may be ok. However, be aware that recent versions of PyTorch are optimized for a 64-bit environment and that is what Gradescope is using.

Ensure to use data standardization, as done in Part 1.