HyperBand Algorithm

1. Intro

Goal: To tune hyper-parameter in machine learning or deep learning, using bandit algorithm, to minimize the time usag

Methods used with

SGD
Tree ensembles

Idea: Try a large number of random configurations, but takes a lot of time. So hyperband:

Run just an iteration (random configuration) or two at first
Evaluate how they perform
Using earlier results to select candidates for longer runs

e.g, following experiments: max iteration:81

Runs Iteration

81 1

27 3

9 9

3 27

It starts with 81 runs, one iteration each. Then the best 27 configurations get three iterations each. Then the best nine get nine, and so on

Runs	Iteration
81	1
27	3
9	9
3	27

2. Learning Rate: Tune or Not

“Low learning rate with many iterations” is good for XGBoost

If we do not tune learning rate, the hyperband make good sense

3. Implementation

Just need two function

one return a random configuration
another one train that configuration for a given number of iterations and return a loss value

4. Experiment

To evaluate the performance of hyperband, we did several experiment. From the high level, we want to answer two major question:

How hyperband perform in different machine learning algorithms.
How hyperband perform compared to other hyper parameter tuning algorithms

4.1. Setting

In our experiments, we are making a binary classification problem. For the dataset, we choose Breast Cancer Wisconsin (Diagnostic) dataset, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

According to Algorithm 1, we set $R=81$ and $\eta=3$, which means maximum iterations per configuration is $3$ and configuration downsampling rate is $3$

To evaluate the performance of hyperband in different machine learning algorithms, we choose gradient boosting, random forest, extremely randomized trees and linear SGD. To compare between different methods, we use log loss as our metric

From the figure below, we can observe that after few iterations, the hyperband algorithm has the ability to optimize the parameter of these machine learning algorithm and make good prediction with low log loss.

To compare the performance of different parameter tuning method, we choose 5 candidate.

Grid search. this method would explore the whole hyperparameter space, every combination given by the space
Randomized search. In contrast to Grid search, a fixed number of parameter setting is sampled from the given parameter space
Halving Grid Search, based on grid search and apply successive halving strategy, all candidate would be evaluated with a small amout of resource at first and iteratively select the best candidates
Halving Randomized search, based on Randomized search and apply successive halving strategy, all candidate would be evaluated with a small amout of resource at first and iteratively select the best candidates
hyperband

Grid based search algorithms are time consuming, since the overall search space is quite large. On the opposite, randomized search algorithms is efficient but the output is unstable, which means the results highly depend on the randomized initial configuration.