Paper: Coresets for Classification – Simplified and Strengthened

We show how to sample a small subset of points from a larger dataset, such that if we solve logistic regression, hinge loss regression (i.e., soft margin SVM), or a number of other problems used to train linear classifiers on the sampled dataset, then we obtain a near optimal solution for the full dataset. This ‘coreset’ guarantee requires sampling the subset of points according to a carefully chosen distribution, which reflects each point’s importance. We use a distribution based on the l_1 Lewis weights, which are closely related to the statistical leverage scores. This allows us to significantly improve the state-of-the-art for the problem.