How to Detect Facial Expression with Transfer Learning and PyTorch 🔥

Khem Sok
6 min readApr 25, 2020


Model Output

Prerequisite — You should have a basic understanding of Neural Network, Loss Function, Backpropagation, and Python


What is Transfer Learning?

Transfer learning is a technique where you use a pre-trained neural network that is related to your task to fine-tune your own model to meet specifications. So essentially, you are using an already built neural network with pre-defined weights and biases and you add your own twist on to it.

The more memory and knowledge we have, the more we are able to learn

Why would you want to do this? Usually transfer learning is used when the dataset you are working on is very minimal. For example, the dataset you are working with may only have 100 samples of data; with this low of a sample, you would not be able to create a good generalized model (especially with image data). Transfer learning is great for cases like this. By using a pre-defined model that has been trained with a huge amount of data, it allows you have a great place to start with and build your model from.

How does it work? Let’s for a second, imagine that you are training an image classification model of two dog breeds: golden retriever and husky. You only have 100 images of both dog breeds and would like to use transfer learning. You downloaded a neural network model that has been trained on millions of images. However, the model is a 100 class classification model that predicts on something entirely different than what you want it to predict. How do we modify this model and make it our own? We’d get rid of the last output layer and instead add our own model on to it. That’s it! You might ask that the previous model was trained on an entirely different image dataset that does not have the dog breeds: how will it help us? Although it was trained on an entirely different dataset, the previous neural network was able to learn the shapes, edges, and patterns which are very beneficial in predicting the new classes.

Our Approach

We will be using the FER dataset from Kaggle to train our image classifier on. The dataset originally comes with 7 classes: Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral. However, I chose to only train on 4 classes: Angry, Happy, Sad, and Neutral. The reason for this decision is because the dataset that was provided is unbalanced and a lower number of classes will lead to a higher accuracy score.

Our pretrained neural network of choice will be Resnet18. I chose this particular model because I believe a lower number of hidden layers will make our model more generalized, especially when trained on a high number of epochs. In addition to that, the training time won’t be as long.

The number of epochs that we will be training the model on is 100. The reason why a high number was chosen is because we are using a learning rate scheduler which makes the LR get smaller as we get keep iterating through the epochs and this should allow us to avoid some local minima issue which in turn should lead us to a higher accuracy score.

The Resnet18 model takes in an input of 3 channels RGB image, but the FER dataset gives us a 1 channel grayscale image. There are two different ways of tackling this problem.

  1. We make a change to the first hidden layer of the Resnet18 model to accomodate a 1 channel image instead of a 3 channel image
  2. We transform the 1 channel grayscale image to a 3 channels RGB image

I decided to go with the latter, the reason being I believe it is a more eloquent and easier solution and because it won’t mess with the integrity of the weights and biases in the subsequent hidden layers as those layers were trained with 3 channels images.

To adjust the Resnet18 model for our own use case, we just need to adjust the last hidden layer to predict 4 classes instead of the original set of classes.

Training Model with PyTorch 🔥

PyTorch gives a very straightforward framework on how to train your model.

  1. First, we load the inputs along with its labels.
  2. Next, we make a prediction with the model. [outputs = model(inputs))]
  3. Afterward, we determine how wrong the predictions are using a pre-defined loss function. [loss = criterion(outputs, labels)]
  4. Then, we calculate the gradients by backpropagating through the entire neural network. [loss.backward()]
  5. Then, we use the calculated gradients in combination with the learning rate of the optimizer to produce new weights and biases to be used to test against a fresh set of data. [optimizer.step()]
  6. The process repeats over and over again until we reached the end of our dataset and epochs.

As we iterate through thousands and thousands of inputs, we are constantly computing the gradients. Using the gradients and learning rate at each step, we are able to continue computing new weights and biases to predict new inputs. So overtime, we are able to tweak our layers’ weights and biases to the point where it can accurately predict the inputs. And that, my friend, is gradient descent. PyTorch allows us to easily build this out with its built-in modules.


Model Metrics
Kaggle Top Performers

Our model was able to do very well against the validation set. If we were to take our accuracy rate and compare it to the top performers on Kaggle, we would rank first! However, there are multiple components in play here as to why the model did so well.

  1. The model was trained on 4 categories instead of 7. I believe this gave me an edge as some categories of the given labeled dataset were unbalanced and would have led to inaccurate predictions.
  2. It is validated upon a lower number of datapoints compared to the Kaggle test set. I split the given labeled dataset 80/20, and used the latter as a validation set which gave me 4210 datapoints to validate the model against and the Kaggle competition was tested on 7179 datapoints.
  3. I believe the hyperparameter tuning of the model played an important role into getting a high accuracy score. The decision to go with a learning rate scheduler was crucial, as it allows the model to get out of local minima issues and keep searching for gradients that would lead to a lower loss.

Using the Model

The strategy that I went with to predict the facial expression of a given image was to detect whether the image had a face in it or not with the library, MTCNN. The library was able to give me the coordinates of where the faces were located which I then used to crop. The cropped image then converted to a 224 x 224 RGB image (input size for ResNet18) which was then placed into the model for prediction.


This project was a very fun one to build. Hopefully, you guys learned something interesting from the article. Let me know if you have any questions!

Have a nice day! 🎯