Kernel selection in SVM for high-dimensional & sparse binary data

One of the biggest challenges of using support vector machine (SVM) for non-linear classification problems is how to select an appropriate kernel class. The goal of the project is to compare the generalization bound for the SVM classifier using different kernel class in case where all features are binary. A generalization bound measures a hypothesis class's predictive performance in PAC learning - that is, with high probability, we select function that will have low generalization error.

Experiment

Mushroom data: response variable is binary (edible or poisonous), features are all categorical and are encoded to binary variables using one-hot encoding (in total 112 features) using a train-test split of 70%/30%.

Optimal hypothesis is obtained using the Empirical Risk Minimization (ERM) principle with the hinge loss function
Hyper-parameters (which measures the spread of the kernel) is tuned using grid search
Polynomial > Sigmoid > Gaussian (based on generalization error which represents the estimated population risk of the classifier).

Theoretical analysis

Use statistical learning theory tools to get some insights.

Surrogate loss and consistency - if we can minimize the hinge loss (surrogate for the 0-1 loss), then we can also minimize the 0-1 loss
Rademacher complexity - derive an upper bound of Rademacher complexity for each kernel class
Generalization bound based on the Rademacher complexity

Summary of analysis results

In practice it is very difficult to test performance of many kernel classes in fine grid search when data size is large. Using the results above, one can first identify a very rough hyper-parameter space, then choose an appropriate kernel class. This will save much time compared to first obtain the optimal hypothesis from each kernel class and then select the best kernel.

Gaussian kernels tend to give higher generalization errors than the polynomial and sigmoid kernels when optimal is chosen.
When the feature dimension is small relative to the hyper-parameter value (d*gamma is small (<< 1)), using a polynomial kernel will tend to give lower generalization errors.
When the feature dimension is large relative to the hyper-parameter value (d*gamma is large (> 1)), using a sigmoid kernel will tend to give lower generalization errors.

Last updated on Nov 1, 2019