Skip to main content

Tuning Hyperparameters

January 22, 2020

Anyone who has attempted to tackle Machine Learning, or even read about Machine Learning (ML), has probably run into the term hyperparameters. It is usually referred to when discussing “hyperparameter optimization” or “hyperparameter tuning”. This step can be crucial when it comes to improving the performance of a model—but what exactly is a hyperparameter?

Parameter vs Hyperparameter

Let’s start with some definitions. Most of us have heard the term parameters in a variety of applications, most commonly in statistics or programming. In an ML model, the parameters are crucial to the training process, as they are what the model is trying to “learn” through the use of an optimization algorithm. Some examples would be weights in a neural network or the coefficients of a regression model.

Like a model parameter, a model hyperparameter is crucial to the training process of a model. The difference is that hyperparameter is outside of the learning process of the model and cannot be determined with the help of the data. The hyperparameters must be determined and tested by the data scientist or ML engineer building the model. Some examples of a hyperparameter would be (Jeremy Jordan):

  • The degree of polynomial features to use for a linear model
  • Maximum depth of a decision tree
  • Minimum number of samples required for a leaf node of a decision tree
  • Number of trees to include in a random forest
  • Number of neurons to include in a neural network layer
  • Number of layers in a neural network
  • Learning rate for gradient descent

As you can see, these values help to make up the specific structure of the chosen model, so optimizing them manually is not a straightforward task.

Now that we know what a hyperparameter is, let’s see how to work with them.

Tuning Hyperparameters

Before we begin work on a model, we must split our data into a training and a testing set. The purpose of these are clear; we use the training set to train the model, and we use the test set to test the accuracy of the model on unseen data. If we used the same dataset for training and testing, we risk overfitting the model to that specific set of data—which will give poor results when new data is introduced. But with the addition of hyperparameters, we need a way to evaluate which values are best, without introducing our test data too early. Depending on your strategy, this can be handled a few ways. If we are working on tuning the hyperparameters manually, we introduce the third split of data: the validation set. The validation set is used to evaluate the model’s performance for different values and combinations of hyperparameters. Then, once the best hyperparameters are chosen and our model is tuned, we can test our model with the test set of data. If we are using any of the other strategies (listed below), most packages will perform sampling and cross-validation without you having to worry about creating a third set of data.

Now that we know how the hyperparameters are evaluated, we need to decide on a few more things before we begin tuning them: optimization function, set of hyperparameter values to test, and a method for choosing which set of hyperparameter values to evaluate.

The optimization function will determine the score of an evaluated model with certain hyperparameters. Depending on the function, the goal of hyperparameter tuning will be to find the hyperparameters that minimize or maximize this function when it is evaluated on the validation set.

In some cases, the ML engineer determines a set of values to test for each hyperparameter chosen, but some packages will choose the set of values on their own. Listed here are a few strategies for choosing which hyperparameter values to evaluate:

Manually– The ML engineer manually chooses and tests different values and combinations of hyperparameters. This strategy can be costly in time for the engineer.

Grid Search– This strategy creates a grid of hyperparameters and will train a model for every combination of those hyperparameter values. Of course, this strategy can be very costly in time and resources, so it is not always the best option.

Random Search– This strategy is similar to Grid Search in that it sets up a grid of hyperparameter values, but it randomly chooses a set number of combinations to train the model with. This is a less costly, but you do run the risk of not finding the best hyperparameter values for your model.

*Automated Hyperparameter Tuning– This strategy does not require a set of values to test. Instead of guessing values, this uses algorithms to do an educated search for hyperparameters. Popular algorithms could be Bayesian Optimization or gradient descent.

In the following section, you will see an example of Grid and Random Search in action.

Example

For this example, we are going to be using the R language and its ML packages. With a random forest model, we will predict whether or not a tumor is benign or malignant (labelled in the data with B and M).

#First we will download the necessary libraries.
library(datasets);
library(rpart);
library(rpart.plot);
library(RCurl);
library(randomForest);
library(mlbench);
library(caret);

print('Done')

[1] “Done”

#Retreive BreastCancer dataset from the UCI Machine Learning Repository
dataURL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
col_names <- c('id_number', 'diagnosis', 'radius_mean', 
         'texture_mean', 'perimeter_mean', 'area_mean', 
         'smoothness_mean', 'compactness_mean', 
         'concavity_mean','concave_points_mean', 
         'symmetry_mean', 'fractal_dimension_mean',
         'radius_se', 'texture_se', 'perimeter_se', 
         'area_se', 'smoothness_se', 'compactness_se', 
         'concavity_se', 'concave_points_se', 
         'symmetry_se', 'fractal_dimension_se', 
         'radius_worst', 'texture_worst', 
         'perimeter_worst', 'area_worst', 
         'smoothness_worst', 'compactness_worst', 
         'concavity_worst', 'concave_points_worst', 
         'symmetry_worst', 'fractal_dimension_worst')
cancer <- data.frame(read.table(textConnection(dataURL), sep = ',', col.names = col_names))

Now that we have our data, let’s look at the data. We can see below that we have a split of 357 benign tumors and 212 malignant, with 31 attributes to make our model on (excluding Id_number and the diagnosis).

summary(cancer["diagnosis"])
cat("Rows: ", nrow(cancer), " ")
cat("Columns: ",  ncol(cancer))
head(cancer)

diagnosis
B:357
M:212

Rows: 569 Columns: 32

A data.frame: 6 x 32

id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst
<int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
842302 M 17.99 10.38 122.8 1001 0.1184 0.2776 0.3001 0.1471 25.38 17.33 184.6 2019 0.1622 0.6656 0.7119 0.2654 0.4601 0.1189
842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869 0.07017 24.99 23.41 158.8 1956 0.1238 0.1866 0.2416 0.186 0.275 0.08902
84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974 0.1279 23.57 25.53 152.5 1709 0.1444 0.4245 0.4504 0.243 0.3613 0.08758
84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414 0.1052 14.91 26.5 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.173
84358402 M 20.29 14.34 135.1 1297 0.1003 0.1328 0.198 0.1043 22.54 16.67 152.2 1575 0.1374 0.205 0.4 0.1625 0.2364 0.07678
843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578 0.08089 15.47 23.75 103.4 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.1244

Next, we will split our data into Training and Test sets. For this example we will be using an 80% training, 20% testing split.

#split data into 80% training, 20% testing
training_size <- floor(.8*nrow(cancer))

set.seed(33)
train_index <- sample(seq_len(nrow(cancer)), size = training_size)

train <- cancer[train_index,]
test <- cancer[-train_index,]

summary(train["diagnosis"])
summary(test["diagnosis"])

diagnosis
B:285
M:170

diagnosis
B:72
M:42

The split seems to be retaining the ratio of the benign and malignant tumors from the full dataset. Now we can move on to creating our model. Subsets of the following code were retrieved from Jason Brownlee’s blog, Machine Learning Mastery.

This first model we will run to find our “base” accuracy with the models default parameters.

# Create model with default parameters

set.seed(1)
mtry <- sqrt(ncol(train))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(diagnosis~., data=train, seed = 1, method="rf", metric="Accuracy", tuneGrid = tunegrid)
print(rf_default)

Random Forest

455 samples
31 predictor
2 classes: ‘B’, ‘M’

No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 455, 455, 455, 455, 455, 455, …
Resampling results:

Accuracy Kappa
0.9459257 0.8841634

Tuning parameter ‘mtry’ was held constant at a value of 5.656854

We can see that our base accuracy with the default parameters is 94.59%. Now we will try to improve our accuracy by tuning the hyperparameters with a Random Search.

#Tune using the caret package's Random Search
set.seed(36)
control <- trainControl(method = 'repeatedcv', number = 10, repeats = 3, search = "random")

mtry<-sqrt(ncol(train))
random <- train(diagnosis~., data = train, seed=36, method = "rf", metric = "Accuracy", tuneLength = 15,  trControl = control)

print(random)

Random Forest

455 samples
31 predictor
2 classes: ‘B’, ‘M’

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 409, 410, 409, 410, 410, 409, …
Resampling results across tuning parameters:

mtry Accuracy Kappa
1 0.9553140 0.9028225
2 0.9523833 0.8976601
3 0.9553462 0.9041486
9 0.9575362 0.9092779
12 0.9582931 0.9109936
13 0.9575845 0.9095088
14 0.9590177 0.9125985
17 0.9590338 0.9126287
19 0.9597907 0.9143132
21 0.9597585 0.9141539
24 0.9568438 0.9082064
25 0.9590499 0.9127880
30 0.9582770 0.9110931

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 19.

Using the Random Search, we increased our accuracy to 95.979%! Now we can try grid search. Remember, it takes longer to train because it has more combinations to try.

#Tune using Grid Search
set.seed(333)
control <- trainControl(method = 'repeatedcv', number = 10, repeats = 3, search = "grid")

mtry<-sqrt(ncol(train))
random <- train(diagnosis~., data = train, method = "rf", seed = 333, metric = "Accuracy", tuneLength=15, trControl = control)

print(random)

Random Forest

455 samples
31 predictor
2 classes: ‘B’, ‘M’

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 410, 410, 409, 409, 409, 410, …
Resampling results across tuning parameters:

mtry Accuracy Kappa
2 0.9552335 0.9040177
4 0.9567311 0.9075156
6 0.9552496 0.9044526
8 0.9559903 0.9061961
10 0.9604026 0.9156466
12 0.9589694 0.9124864
14 0.9589211 0.9124638
16 0.9552979 0.9046281
18 0.9582287 0.9108091
20 0.9567472 0.9077846
22 0.9552657 0.9046924
24 0.9560064 0.9062175
26 0.9538164 0.9015049
28 0.9575040 0.9092138
31 0.9567472 0.9075092

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 10.

By going through the whole grid, we have increased out Accuracy ever so slightly to 96.04%. It is worth noting that this small increase may not be worth the training time to do an entire grid search. Random search performed almost equally well.

In this Breast Cancer example, we were able to increase the accuracy of our model by ~1.5% just by changing our hyperparameters. All without having to touch the data! Hyperparameter tuning is an important way to improve upon your model without changing the structure of the data. Grid search and random search are common ways to tackle and understand this step in the ML process, but Automated tuning is becoming increasingly popular in the field.

In this article you have learned the difference between a parameter and hyperparameter, as well as a few strategies to optimizing, or tuning, hyperparameters. The Breast Cancer example gave us a look into Grid Search and Random Search with R, to show us how easy it is to boost our model’s accuracy without needing to worry about the data.

For further reading and resources, check out the links below

Data retrieved from:

O. L. Mangasarian, R. Setiono, and W.H. Wolberg: “Pattern recognition via linear programming: Theory and application to medical diagnosis”, in: “Large-scale numerical optimization”, Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.