from IPython.display import Image
Image("img/mha.PNG")


from mha import MultiHeadAttention
help(MultiHeadAttention.compute_mh_qkv_transformation)

Help on function compute_mh_qkv_transformation in module mha:

compute_mh_qkv_transformation(self, Q, K, V)
    Transform query, key and value using W_q, W_k, W_v and split 
    
    Input:
        Q (torch.Tensor) - Query tensor of size B x T_q x d_model.
        K (torch.Tensor) - Key tensor of size B x T_k x d_model.
        V (torch.Tensor) - Value tensor of size B x T_v x d_model. Note that T_k = T_v.
    
    Output:
        q (torch.Tensor) - Transformed query tensor B x num_heads x T_q x d_k.
        k (torch.Tensor) - Transformed key tensor B x num_heads x T_k x d_k.
        v (torch.Tensor) - Transformed value tensor B x num_heads x T_v x d_k. Note that T_k = T_v
        Note that d_k * num_heads = d_model


help(MultiHeadAttention.compute_scaled_dot_product_attention)

Help on function compute_scaled_dot_product_attention in module mha:

compute_scaled_dot_product_attention(self, query, key, value, key_padding_mask=None, attention_mask=None)
    This function calculates softmax(Q K^T / sqrt(d_k))V for the attention heads; further, a key_padding_mask is given so that padded regions are not attended, and an attention_mask is provided so that we can disallow attention for some part of the sequence
    Input:
    query (torch.Tensor) - Query; torch tensor of size B x num_heads x T_q x d_k, where B is the batch size, T_q is the number of time steps of the query (aka the target sequence), num_head is the number of attention heads, and d_k is the feature dimension;
    
    key (torch.Tensor) - Key; torch tensor of size B x num_head x T_k x d_k, where in addition, T_k is the number of time steps of the key (aka the source sequence);
    
    value (torch.Tensor) - Value; torch tensor of size B x num_head x T_v x d_k; where in addition, T_v is the number of time steps of the value (aka the source sequence);, and we assume d_v = d_k
    Note: We assume T_k = T_v as the key and value come from the same source in the Transformer implementation, in both the encoder and the decoder.
    
    key_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_k, where for each key_padding_mask[b] for the b-th source in the batch, the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
    attention_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size 1 x T_q x T_k or B x T_q x T_k, where again, T_q is the length of the target sequence, and T_k is the length of the source sequence. An example of the attention_mask is used for decoder self-attention to enforce auto-regressive property during parallel training; suppose the maximum length of a batch is 5, then the attention_mask for any input in the batch will look like this for each input of the batch.
    0 1 1 1 1
    0 0 1 1 1
    0 0 0 1 1
    0 0 0 0 1
    0 0 0 0 0
    As the key_padding_mask, the non-zero positions will be ignored and disallowed for attention while the zero positions will be allowed for attention.
    
    
    Output:
    x (torch.Tensor) - torch tensor of size B x T_q x d_model, which is the attended output


!python grade.py

EEE....FEE
======================================================================
ERROR: test_encoder_decoder_predictions (test_visible.TestStep.test_encoder_decoder_predictions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 311, in test_encoder_decoder_predictions
    output = model(src = src, tgt = trg, src_lengths = src_lengths, tgt_lengths = trg_lengths)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 230, in forward
    enc_output, src_padding_mask = self.forward_encoder(src, src_lengths)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 155, in forward_encoder
    src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not NoneType

======================================================================
ERROR: test_encoder_decoder_states (test_visible.TestStep.test_encoder_decoder_states)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 365, in test_encoder_decoder_states
    output = model(src = src, tgt = trg, src_lengths = src_lengths, tgt_lengths = trg_lengths)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 230, in forward
    enc_output, src_padding_mask = self.forward_encoder(src, src_lengths)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 155, in forward_encoder
    src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not NoneType

======================================================================
ERROR: test_encoder_output (test_visible.TestStep.test_encoder_output)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 272, in test_encoder_output
    output_encoder, _ = model.forward_encoder(src = src, src_lengths = src_lengths)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 155, in forward_encoder
    src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not NoneType

======================================================================
ERROR: test_decoder_inference_cache_extra_credit (test_visible_ec.TestStep.test_decoder_inference_cache_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 162, in test_decoder_inference_cache_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 254, in inference
    enc_output, _ = self.forward_encoder(src, src_lengths)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 155, in forward_encoder
    src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not NoneType

======================================================================
ERROR: test_decoder_inference_outputs_extra_credit (test_visible_ec.TestStep.test_decoder_inference_outputs_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 109, in test_decoder_inference_outputs_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 254, in inference
    enc_output, _ = self.forward_encoder(src, src_lengths)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 155, in forward_encoder
    src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not NoneType

======================================================================
FAIL: test_pe (test_visible.TestStep.test_pe)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 236, in test_pe
    self.assertAlmostEqual(torch.sum(torch.abs(pe.pe - self.pe_test_data["pe"])).item(), 0, places = 4, msg='Positional Encoding has incorrect encoding entries')
AssertionError: 270.70733642578125 != 0 within 4 places (270.70733642578125 difference) : Positional Encoding has incorrect encoding entries

----------------------------------------------------------------------
Ran 10 tests in 1.604s

FAILED (failures=1, errors=5)


!python grade.py

EEF.....EE
======================================================================
ERROR: test_encoder_decoder_predictions (test_visible.TestStep.test_encoder_decoder_predictions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 311, in test_encoder_decoder_predictions
    output = model(src = src, tgt = trg, src_lengths = src_lengths, tgt_lengths = trg_lengths)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 233, in forward
    dec_output = self.forward_decoder(enc_output, src_padding_mask, tgt, tgt_lengths)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 208, in forward_decoder
    return dec_output
           ^^^^^^^^^^
NameError: name 'dec_output' is not defined

======================================================================
ERROR: test_encoder_decoder_states (test_visible.TestStep.test_encoder_decoder_states)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 365, in test_encoder_decoder_states
    output = model(src = src, tgt = trg, src_lengths = src_lengths, tgt_lengths = trg_lengths)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 233, in forward
    dec_output = self.forward_decoder(enc_output, src_padding_mask, tgt, tgt_lengths)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 208, in forward_decoder
    return dec_output
           ^^^^^^^^^^
NameError: name 'dec_output' is not defined

======================================================================
ERROR: test_decoder_inference_cache_extra_credit (test_visible_ec.TestStep.test_decoder_inference_cache_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 162, in test_decoder_inference_cache_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 274, in inference
    tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
                                                                                ^^^
NameError: name 'tgt' is not defined

======================================================================
ERROR: test_decoder_inference_outputs_extra_credit (test_visible_ec.TestStep.test_decoder_inference_outputs_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 109, in test_decoder_inference_outputs_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 274, in inference
    tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
                                                                                ^^^
NameError: name 'tgt' is not defined

======================================================================
FAIL: test_encoder_output (test_visible.TestStep.test_encoder_output)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 279, in test_encoder_output
    self.assertAlmostEqual(torch.sum(torch.abs(ref_enc_item - enc_item)).item(), 0, places = 4, msg=f'The encoder output for the sample #{item_idx} in the batch #{it_idx} is not correct')
AssertionError: 646.3253784179688 != 0 within 4 places (646.3253784179688 difference) : The encoder output for the sample #0 in the batch #0 is not correct

----------------------------------------------------------------------
Ran 10 tests in 1.817s

FAILED (failures=1, errors=4)


from IPython.display import Image
Image("img/encoder.PNG")


from encoder import TransformerEncoder, TransformerEncoderLayer
help(TransformerEncoderLayer.forward)

Help on function forward in module encoder:

forward(self, x, self_attn_padding_mask=None)
    Applies the self attention module + Dropout + Add & Norm operation, and the position-wise feedforward network + Dropout + Add & Norm operation. Note that LayerNorm is applied after the self-attention, and another time after the ffn modules, similar to the original Transformer implementation.
    
    Input:
        x (torch.Tensor) - input tensor of size B x T x embedding_dim from the encoder input or the previous encoder layer; serves as input to the TransformerEncoderLayer's self attention mechanism.
    
        self_attn_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T, where for each self_attn_padding_mask[b] for the b-th source in the batch, the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
    Output:
        x (torch.Tensor) - the encoder layer's output, of size B x T x embedding_dim, after the self attention module + Dropout + Add & Norm operation, and the position-wise feedforward network + Dropout + Add & Norm operation.


help(TransformerEncoder.forward)

Help on function forward in module encoder:

forward(self, x, encoder_padding_mask=None)
    Applies the encoder layers in self.layers one by one, followed by an optional output layer if it exists
    
    Input:
        x (torch.Tensor) - input tensor of size B x T x embedding_dim; input to the TransformerEncoderLayer's self attention mechanism
    
        encoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T, where for each encoder_padding_mask[b] for the b-th source in the batch, the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
    Output:
        x (torch.Tensor) - the Transformer encoder's output, of size B x T x embedding_dim, if output layer is None, or of size B x T x output_layer_size, if there is an output layer.


from transformer import Transformer, length_to_padding_mask
help(Transformer.forward_encoder)

Help on function forward_encoder in module transformer:

forward_encoder(self, src, src_lengths)
    Applies the Transformer encoder to src, where each sequence in src has been padded to the max(src_lengths)
    
    Input:
        src (torch.Tensor) - Encoder's input tensor of size B x T_e x d_model
    
        src_lengths (torch.Tensor) - A 1D iterable of Long/Int of length B, where the b-th length in src_lengths corresponds to the actual length of src[b] (beyond that is the pre-padded region); T_e = max(src_lengths)
    
    Output:
        enc_output (torch.Tensor) - the Transformer encoder's output, of size B x T_e x d_model
    
        src_padding_mask (torch.Tensor) - the encoder_padding_mask/key_padding_mask used by the Transformer encoder's self-attention; this should be created from src_lengths


help(length_to_padding_mask)

Help on function length_to_padding_mask in module transformer:

length_to_padding_mask(lengths, device='cpu', dtype=torch.int64)
    Convert a list/1D tensor/1D array of length in to padding masks used by the encoder and the decoder's attention mechanism
    
    For example, length_to_padding_mask([3, 4, 5]) will return a torch.tensor of dtype, on the device, as:
    [[0, 0, 0, 1, 1],
     [0, 0, 0, 0, 1],
     [0, 0, 0, 0, 0]]
    
    Input:
        lengths (List/torch.Tensor/np.array) - a 1D iterable List/torch.Tensor/np.array 
        device (str/torch.Tensor.device): where the return tensor will be located, say, "cpu" or "cuda" or torch.Tensor.device
        dtype (torch.dtype) - result dtype
    
    Output:
        ret (torch.Tensor) - a padding mask of size len(lengths) x max(lengths), with non-zero positions indicating locations out of bounds , of "dtype", on the "device"


!python grade.py

EE......EE
======================================================================
ERROR: test_encoder_decoder_predictions (test_visible.TestStep.test_encoder_decoder_predictions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 311, in test_encoder_decoder_predictions
    output = model(src = src, tgt = trg, src_lengths = src_lengths, tgt_lengths = trg_lengths)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 233, in forward
    dec_output = self.forward_decoder(enc_output, src_padding_mask, tgt, tgt_lengths)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 208, in forward_decoder
    return dec_output
           ^^^^^^^^^^
NameError: name 'dec_output' is not defined

======================================================================
ERROR: test_encoder_decoder_states (test_visible.TestStep.test_encoder_decoder_states)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible.py", line 365, in test_encoder_decoder_states
    output = model(src = src, tgt = trg, src_lengths = src_lengths, tgt_lengths = trg_lengths)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome-ni/anaconda3/envs/test_transformer_mp_torch_2.0.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 233, in forward
    dec_output = self.forward_decoder(enc_output, src_padding_mask, tgt, tgt_lengths)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 208, in forward_decoder
    return dec_output
           ^^^^^^^^^^
NameError: name 'dec_output' is not defined

======================================================================
ERROR: test_decoder_inference_cache_extra_credit (test_visible_ec.TestStep.test_decoder_inference_cache_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 162, in test_decoder_inference_cache_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 274, in inference
    tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
                                                                                ^^^
NameError: name 'tgt' is not defined

======================================================================
ERROR: test_decoder_inference_outputs_extra_credit (test_visible_ec.TestStep.test_decoder_inference_outputs_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 109, in test_decoder_inference_outputs_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 274, in inference
    tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
                                                                                ^^^
NameError: name 'tgt' is not defined

----------------------------------------------------------------------
Ran 10 tests in 1.876s

FAILED (errors=4)


from IPython.display import Image
Image("img/encoder_decoder.PNG")


from decoder import TransformerDecoder, TransformerDecoderLayer
help(TransformerDecoderLayer.forward)

Help on function forward in module decoder:

forward(self, x, encoder_out=None, encoder_padding_mask=None, self_attn_padding_mask=None, self_attn_mask=None)
    Applies the self attention module + Dropout + Add & Norm operation, the encoder-decoder attention + Dropout + Add & Norm operation (if self.encoder_attn is not None), and the position-wise feedforward network + Dropout + Add & Norm operation. Note that LayerNorm is applied after the self-attention operation, after the encoder-decoder attention operation and another time after the ffn modules, similar to the original Transformer implementation.
    
    Input:
        x (torch.Tensor) - input tensor of size B x T_d x embedding_dim from the decoder input or the previous encoder layer, where T_d is the decoder's temporal dimension; serves as input to the TransformerDecoderLayer's self attention mechanism.
    
        encoder_out (None/torch.Tensor) - If it is not None, then it is the output from the TransformerEncoder as a tensor of size B x T_e x embedding_dim, where T_e is the encoder's temporal dimension; serves as part of the input to the TransformerDecoderLayer's self attention mechanism (hint: which part?).
        
        encoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_e, where for each encoder_padding_mask[b] for the b-th source in the batched tensor encoder_out[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        self_attn_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_d, where for each self_attn_padding_mask[b] for the b-th source in the batched tensor x[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        self_attn_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size 1 x T_d x T_d or B x T_d x T_d. It is used for decoder self-attention to enforce auto-regressive property during parallel training; suppose the maximum length of a batch is 5, then the attention_mask for any input in the batch will look like this for each input of the batch.
        0 1 1 1 1
        0 0 1 1 1
        0 0 0 1 1
        0 0 0 0 1
        0 0 0 0 0
        The non-zero positions will be ignored and disallowed for attention while the zero positions will be allowed for attention.
    
    Output:
        x (torch.Tensor) - the decoder layer's output, of size B x T_d x embedding_dim, after the self attention module + Dropout + Add & Norm operation, the encoder-decoder attention + Dropout + Add & Norm operation (if self.encoder_attn is not None), and the position-wise feedforward network + Dropout + Add & Norm operation.


help(TransformerDecoder.forward)

Help on function forward in module decoder:

forward(self, x, decoder_padding_mask=None, decoder_attention_mask=None, encoder_out=None, encoder_padding_mask=None)
    Applies the encoder layers in self.layers one by one, followed by an optional output layer if it exists
    
    Input:
        x (torch.Tensor) - input tensor of size B x T_d x embedding_dim; input to the TransformerDecoderLayer's self attention mechanism
    
        decoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_d, where for each decoder_padding_mask[b] for the b-th source in the batched tensor x[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        decoder_attention_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size 1 x T_d x T_d or B x T_d x T_d. It is used for decoder self-attention to enforce auto-regressive property during parallel training; suppose the maximum length of a batch is 5, then the attention_mask for any input in the batch will look like this for each input of the batch.
        0 1 1 1 1
        0 0 1 1 1
        0 0 0 1 1
        0 0 0 0 1
        0 0 0 0 0
        The non-zero positions will be ignored and disallowed for attention while the zero positions will be allowed for attention.
    
        encoder_out (None/torch.Tensor) - If it is not None, then it is the output from the TransformerEncoder as a tensor of size B x T_e x embedding_dim, where T_e is the encoder's temporal dimension; serves as part of the input to the TransformerDecoderLayer's self attention mechanism (hint: which part?).
    
        encoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_e, where for each encoder_padding_mask[b] for the b-th source in the batch, the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
    Output:
        x (torch.Tensor) - the Transformer decoder's output, of size B x T_d x embedding_dim, if output layer is None, or of size B x T_d x output_layer_size, if there is an output layer.


help(Transformer.forward_decoder)

Help on function forward_decoder in module transformer:

forward_decoder(self, enc_output, src_padding_mask, tgt, tgt_lengths)
    Applies the Transformer decoder to tgt and enc_output (possibly as used during training to obtain the next token prediction under teacher-forcing), where sequences in enc_output are associated with src_padding_mask, and each sequence in tgt has been padded to the max(tgt_lengths)
    
    Input:
        enc_output (torch.Tensor) - the Transformer encoder's output, of size B x T_e x d_model
    
        src_padding_mask (torch.Tensor) - the encoder_padding_mask/key_padding_mask associated with enc_output. It is a torch.IntTensor/torch.LongTensor of size B x T_e, where for each src_padding_mask[b] for the b-th source in the batched tensor enc_output[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        tgt (torch.Tensor) - Decoder's input tensor of size B x T_d x d_model
    
        tgt_lengths (torch.Tensor) - A 1D iterable of Long/Int of length B, where the b-th length in tgt_lengths corresponds to the actual length of tgt[b] (beyond that is the pre-padded region); T_d = max(tgt_lengths)
    
    
    Output:
        dec_output (torch.Tensor) - the Transformer's final output from the decoder of size B x T_d x tgt_vocab_size, as there is an output layer.


from transformer import subsequent_mask
help(subsequent_mask)

Help on function subsequent_mask in module transformer:

subsequent_mask(size, device='cpu', dtype=torch.int64)
    Create mask for subsequent steps size x size; this may be useful for creating decoder attention masks for parallel auto-regressive training.
    
    subsequent_mask(3) will return a torch.tensor of dtype, on the device, as:
    [[0, 1, 1],
     [0, 0, 1],
     [0, 0, 0]]
    
    Input:
        size (int) - size of mask
        device (str/torch.Tensor.device): where the return tensor will be located, say, "cpu" or "cuda" or torch.Tensor.device
        dtype (torch.dtype) - result dtype
    
    Output:
        torch.Tensor - mask for subsequent steps with shape as size x size, of "dtype", on the "device"


!python grade.py

........EE
======================================================================
ERROR: test_decoder_inference_cache_extra_credit (test_visible_ec.TestStep.test_decoder_inference_cache_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 162, in test_decoder_inference_cache_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 274, in inference
    tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
                                                                                ^^^
NameError: name 'tgt' is not defined

======================================================================
ERROR: test_decoder_inference_outputs_extra_credit (test_visible_ec.TestStep.test_decoder_inference_outputs_extra_credit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/tests/test_visible_ec.py", line 109, in test_decoder_inference_outputs_extra_credit
    output_list, decoder_cache = model.inference(src = src, src_lengths = src_lengths, max_output_length = MAX_INFERENCE_LENGTH)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/junru/Downloads/transformer_mp/src_test/transformer.py", line 274, in inference
    tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
                                                                                ^^^
NameError: name 'tgt' is not defined

----------------------------------------------------------------------
Ran 10 tests in 2.064s

FAILED (errors=2)


from IPython.display import SVG
SVG("img/encoder_decoder_teacher_forcing.SVG")


from IPython.display import SVG
SVG("img/encoder_decoder_inference.SVG")


help(TransformerDecoderLayer.forward_one_step_ec)

Help on function forward_one_step_ec in module decoder:

forward_one_step_ec(self, x, encoder_out=None, encoder_padding_mask=None, self_attn_padding_mask=None, self_attn_mask=None, cache=None)
    Applies the self attention module + Dropout + Add & Norm operation, the encoder-decoder attention + Dropout + Add & Norm operation (if self.encoder_attn is not None), and the position-wise feedforward network + Dropout + Add & Norm operation, but for just a single time step at the last time step. Note that LayerNorm is applied after the self-attention operation, after the encoder-decoder attention operation and another time after the ffn modules, similar to the original Transformer implementation.
    
    Input:
        x (torch.Tensor) - input tensor of size B x T_d x embedding_dim from the decoder input or the previous encoder layer, where T_d is the decoder's temporal dimension; serves as input to the TransformerDecoderLayer's self attention mechanism. You need to correctly slice x in the function below so that it is only calculating a one-step (one frame in length in the temporal dimension) decoder output of the last time step.
    
        encoder_out (None/torch.Tensor) - If it is not None, then it is the output from the TransformerEncoder as a tensor of size B x T_e x embedding_dim, where T_e is the encoder's temporal dimension; serves as part of the input to the TransformerDecoderLayer's self attention mechanism (hint: which part?).
        
        encoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_e, where for each encoder_padding_mask[b] for the b-th source in the batched tensor encoder_out[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        self_attn_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_d, where for each self_attn_padding_mask[b] for the b-th source in the batched tensor x[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence. If it is not None, then you need to correctly slice it in the function below so that it is corresponds to the self_attn_padding_mask for calculating a one-step (one frame in length in the temporal dimension) decoder output of the last time step.
    
        self_attn_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size 1 x T_d x T_d or B x T_d x T_d. It is used for decoder self-attention to enforce auto-regressive property during parallel training; suppose the maximum length of a batch is 5, then the attention_mask for any input in the batch will look like this for each input of the batch.
        0 1 1 1 1
        0 0 1 1 1
        0 0 0 1 1
        0 0 0 0 1
        0 0 0 0 0
        The non-zero positions will be ignored and disallowed for attention while the zero positions will be allowed for attention. If it is not None, then you need to correctly slice it in the function below so that it is corresponds to the self_attn_padding_mask for calculating a one-step (one frame in length in the temporal dimension) decoder output of the last time step.
    
        cache (torch.Tensor) - the output from this decoder layer previously computed up until the previous time step before the last; hence it is of size B x (T_d-1) x embedding_dim. It is to be concatenated with the single time-step output calculated in this function before being returned
    
    
    Returns:
        x (torch.Tensor) - Output tensor B x T_d x embedding_dim, which is a concatenation of cache (previously computed up until the previous time step before the last) and the newly computed one-step decoder output for the last time step.


import decoder
import importlib
importlib.reload(decoder)
help(decoder.TransformerDecoder.forward_one_step_ec)

Help on function forward_one_step_ec in module decoder:

forward_one_step_ec(self, x, decoder_padding_mask=None, decoder_attention_mask=None, encoder_out=None, encoder_padding_mask=None, cache=None)
    Forward one step.
    
    Input:
        x (torch.Tensor) - input tensor of size B x T_d x embedding_dim; input to the TransformerDecoderLayer's self attention mechanism
    
        decoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_d, where for each decoder_padding_mask[b] for the b-th source in the batched tensor x[b], the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        decoder_attention_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size 1 x T_d x T_d or B x T_d x T_d. It is used for decoder self-attention to enforce auto-regressive property during parallel training; suppose the maximum length of a batch is 5, then the attention_mask for any input in the batch will look like this for each input of the batch.
        0 1 1 1 1
        0 0 1 1 1
        0 0 0 1 1
        0 0 0 0 1
        0 0 0 0 0
        The non-zero positions will be ignored and disallowed for attention while the zero positions will be allowed for attention.
    
        encoder_out (None/torch.Tensor) - If it is not None, then it is the output from the TransformerEncoder as a tensor of size B x T_e x embedding_dim, where T_e is the encoder's temporal dimension; serves as part of the input to the TransformerDecoderLayer's self attention mechanism (hint: which part?).
    
        encoder_padding_mask (None/torch.Tensor) - If it is not None, then it is a torch.IntTensor/torch.LongTensor of size B x T_e, where for each encoder_padding_mask[b] for the b-th source in the batch, the non-zero positions will be ignored as they represent the padded region during batchify operation in the dataloader (i.e., disallowed for attention) while the zero positions will be allowed for attention as they are within the length of the original sequence
    
        cache (None/List[torch.Tensor]) -  If it is not None, then it is a list of cache tensors of each decoder layer calculated until and including the previous time step; hence, if it is not None, then each tensor in the list is of size B x (T_d-1) x embedding_dim; the list length is equal to len(self.layers), or the number of decoder layers.
    
    Output:
        y (torch.Tensor) -  Output tensor from the Transformer decoder consisting of a single time step, of size B x 1 x embedding_dim, if output layer is None, or of size B x 1 x output_layer_size, if there is an output layer.
    
        new_cache (List[torch.Tensor]) -  List of cache tensors of each decoder layer for use by the auto-regressive decoding of the next time step; each tensor is of size B x T_d x embedding_dim; the list length is equal to len(self.layers), or the number of decoder layers.


help(Transformer.inference)

Help on function inference in module transformer:

inference(self, src, src_lengths, max_output_length)
    Applies the entire Transformer encoder-decoder to src and target, possibly as used during inference to auto-regressively obtain the next token; each sequence in src has been padded to the max(src_lengths)
    Input:
        src (torch.Tensor) - Encoder's input tensor of size B x T_e x d_model
    
        src_lengths (torch.Tensor) - A 1D iterable of Long/Int of length B, where the b-th length in src_lengths corresponds to the actual length of src[b] (beyond that is the pre-padded region); T_e = max(src_lengths)
    
        
    Output:
        decoded_list (List(torch.Tensor) - a list of auto-regressively obtained decoder output token predictions; the b-th item of the decoded_list should be the output from src[b], and each of the sequence predictions in decoded_list is of a possibly different length. 
    
        decoder_layer_cache_list (List(List(torch.Tensor))) - a list of decoder_layer_cache; the b-th item of the decoded_layer_cache_list should be the decoder_layer_cache for the src[b], which itself is a list of torch.Tensor, as returned by self.decoder.forward_one_step_ec (see the function definition there for more details) when the auto-regressive inference ends for src[b].


!python grade.py

..........
----------------------------------------------------------------------
Ran 10 tests in 3.391s

OK

CS440/ECE448 Spring 2024¶

MP09: Transformer¶

Table of Contents¶

Code structure ¶

Multi-head Attention ¶

Positional Encoding ¶

Transformer Encoder ¶

Transformer Decoder ¶

Submission¶

Extra Credit: Auto-regressive Decoding During Inference ¶