Disparities in Natural Language Processing

Do language technologies equitably serve all groups of people? The way we speak and write varies across demographics and social communities — but natural language processing models can be quite brittle to this variation. If an NLP system, such as machine translation or opinion analysis, works well for some groups of people but not others, that impedes information access and the ability of authors’ voices to be heard, since media communication is now filtered through search and newsfeed relevance algorithms.

We are pursuing an interdisciplinary project to analyze language model’s disparities across social communities, in particular African-American Vernacular English, a major dialect with marked differences compared to mainstream English. While it is used widely in oral and social media communication, it has very little presence in the well-edited texts that comprise traditional NLP corpora. We have constructed a corpus of informal AAE from publicly available social media posts and found a variety of NLP systems work worse on this text, and have developed more equitable models for analysis tasks such as language identification and parsing. By collaborating between sociolinguistics and computer science, this work seeks to support social scientific analysis goals, as well as use social science insights to inform the construction of more effective and fairer language technologies.

Publications

Demographic Dialectal Variation in Social Media: A Case Study of African-American English. Su Lin Blodgett, Lisa Green, and Brendan O’Connor. Proceedings of EMNLP 2016.
Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English. Su Lin Blodgett and Brendan O’Connor. Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) workshop at KDD 2017.
A Dataset and Classifier for Recognizing Social Media English. Su Lin Blodgett, Johnny Tian-Zheng Wei, and Brendan O’Connor. 3rd Workshop on Noisy User-generated Text (WNUT) at EMNLP 2017.
Twitter Universal Dependency Parsing for African-American and Mainstream American English. Su Lin Blodgett, Johnny Tian-Zheng Wei, and Brendan O’Connor. Proceedings of ACL 2018.