Comments

Published in Journal of the American Statistical Association, 97, 136-148.

Abstract

Since high breakdown estimators are impractical to compute exactly in large samples, approximate algorithms are used. The algorithm generally produces an estimator with a lower consistency rate and breakdown value than the exact theoretical estimator. This discrepancy grows with the sample size, with the implication that huge computations are needed for good approximations in large high-dimensioned samples

The workhorse for HBE has been the ‘elemental set’, or ‘basic resampling’ algorithm. This turns out to be completely ineffective in high dimensions with high levels of contamination. However, enriching it with a “concentration” step turns it into a method that is able to handle even high levels of contamination, provided the regression outliers are located on random cases. It remains ineffective if the regression outliers are concentrated on high leverage cases. We focus on the multiple regression problem, but several of the broad conclusions – notably those of the inadequacy of fixed numbers of elemental starts – are relevant to multivariate location and dispersion estimation as well.

We introduce a new algorithm – the “X-cluster” method – for large high-dimensional multiple regression data sets that are beyond the reach of standard resampling methods. This algorithm departs sharply from current HBE algorithms in that, even at a constant percentage of contamination, it is more effective the larger the sample, making a compelling case for using it in the large-sample situations that current methods serve poorly. A multi-pronged analysis, using both traditional OLS and L1 methods along with newer resistant techniques, will often detect departures from the multiple regression model that can not be detected by any single estimator.

Share

COinS