Data Diversity

The big data revolution and advancements in machine learning technologies have revolutionized decision making, advertising, medicine, and even election campaigns. Yet, data is an imperfect medium, often tainted by skews and biases. Learning systems and analysis software learn and amplify these biases. As a result, discrimination shows up in many data-driven applications, such as advertisements, hotel bookings, image search, and vendor services. Since data skew is often a cause of algorithmic bias, the ability to retrieve balanced, diverse datasets can mitigate the underlying problem. Diversification also has usability implications, as it allows us to produce representative samples of a dataset that are small enough for human consumption. Our research focuses on developing methods for producing appropriately diverse subsets of given datasets efficiently and scalably, aiming to alleviate biases in the underlying data and to facilitate user-facing data exploration systems.