Disparities in Natural Language Processing

Do language technologies equitably serve all groups of people?  The way we speak and write varies across demographics and social communities — but natural language processing models can be quite brittle to this variation.  If an NLP system, such as machine translation or opinion analysis, works well for some groups of people but not others, that impedes information access and the ability of authors’ voices to be heard, since media communication is now filtered through search and newsfeed relevance algorithms.

We are pursuing an interdisciplinary project to analyze language model’s disparities across social communities, in particular African-American Vernacular English, a major dialect with marked differences compared to mainstream English. While it is used widely in oral and social media communication, it has very little presence in the well-edited texts that comprise traditional NLP corpora.  We have constructed a corpus of informal AAE from publicly available social media posts and found a variety of NLP systems work worse on this text, and have developed more equitable models for analysis tasks such as language identification and parsing. By collaborating between sociolinguistics and computer science, this work seeks to support social scientific analysis goals, as well as use social science insights to inform the construction of more effective and fairer language technologies.