About this Event
230 S Bouquet St, Pittsburgh, PA 15213
Statistics PhD Defense: Robust Estimation and Inference under Huber’s Contamination Model
Huber's contamination model is widely used for analyzing distributional robustness when the shape of the real underlying data distribution deviates from the model that is assumed. Specifically, it models that the observed data are contaminated by some arbitrary unknown distribution with a small fraction. In this dissertation, we study the robust regression and robust density estimation under Huber’s contamination model.
In the regression setting, we assume that the noise has a heavy-tailed distribution and may be arbitrarily contaminated with a small fraction under an increasing dimension regime. We show that robust M-estimators (such as the Huber estimator) of the coefficients except the intercept can still achieve the minimax convergence rate even if the uncontaminated distribution of the noise is asymmetric. We develop a multiplier bootstrap technique to construct confidence intervals for linear functionals of the coefficients. When the contamination proportion is relatively large, we further provide a bias correction procedure to alleviate the bias due to contamination. The robust estimation and inference framework can be extended to a distributed context, where the overall data is distributed across multiple machines and communication between machines is often constrained due to limited bandwidth or privacy concerns. Specifically, we demonstrate that a communication-efficient M-estimator can attain the centralized minimax rate (as if one has access to the entire data) with the distributed contaminated data. Moreover, based on this communication-efficient M-estimator, a distributed multiplier bootstrap method is proposed only on the master machine, which is able to generate confidence intervals with optimal widths. A comprehensive simulation study demonstrates the effectiveness of our proposed procedures.
In the density estimation setting, we aim to robustly estimate a multivariate density function on R^d with L_p loss functions from contaminated data. To investigate the contamination effect on the optimal estimation of the density, we first establish the minimax rate with the assumption that the density is in an anisotropic Nikol’skii class. We then develop a data-driven bandwidth selection procedure for kernel estimators, which can be viewed as a robust generalization of the Goldenshluger-Lepski method. We show that the proposed bandwidth selection rule can lead to the estimator being minimax adaptive to either the smoothness parameter or the contamination proportion. When both of them are unknown, we prove that finding any minimax-rate adaptive method is impossible. Extensions to smooth contamination cases are also discussed.
Committee Chair and Advisor: Dr. Zhao Ren
Please let us know if you require an accommodation in order to participate in this event. Accommodations may include live captioning, ASL interpreters, and/or captioned media and accessible documents from recorded events. At least 5 days in advance is recommended.