In July 2020, the 80 Million Tiny Images dataset, created by professors from MIT and NYU, was taken down after nearly 15 years of being used to train machine learning systems to identify people and objects. The dataset, scraped from search engines like Google and Flickr, contained racist and sexist labels and categories, causing models trained on them to use offensive terms to describe people. In Large image datasets: A pyrrhic win for computer vision?, the analysis and audit paper that eventually led to the dataset’s demise, the dataset was found to contain many images labeled with stereotypical and offensive words. By the time it was taken down, 80 Million Tiny Images had already been cited more than 1,700 times and was used to train countless more neural networks.

The 80 Million Tiny Images dataset debacle is indicative of a much larger issue, which is the problem of bias in AI. 

Biased Data, Biased Models

In an ideal world, AI systems would be impartial and free of any type of prejudice, but, unfortunately, AI systems are really just as biased as the data they’ve been fed. Because most AI models are trained using datasets that reflect the bias of society as a whole, it is not uncommon to see certain groups of people represented more frequently, and in a better light, than others. 

Bias is usually unconscious, reflecting a lack of exposure to other viewpoints, yet unfortunately there are examples of intentional, structural biases as well. I don’t think it’s too much to claim that nearly everyone on earth—including the folks we trust to build objective, impartial AI systems—possesses some unconscious biases. And so they may pass these prejudices onto whatever datasets they are handling or labeling. 

Courtney Thomas, a researcher at the University of Rochester, describes this as stubbing a toe and exclaiming a profanity in front of a child. Whether it is intended or not, the child may internalize and repeat that profanity. These different forms of bias have real life consequences, from predictive policing models that unfairly target people of color to generative art models that reduce marginalized groups to stereotypes. As more and more government and private sectors welcome AI and use it to automate their processes, the chance of harm caused by either data or societal bias increases. 

One side effect of training AI models with biased data is that sometimes it helps us notice things that we might have otherwise overlooked. One of those things is how much bias there is in data, and AI reflects bias in real life. A great example is the absence of diverse datasets made clear by PortraitAI’s art generator, which promises to turn selfies into Baroque and Renaissance paintings. Unfortunately, their dataset mostly consists of paintings of white Europeans  from that particular era, meaning that selfies of people of color were lightened and given European features. In an apology published on their website, PortraitAI confirmed that their model was trained on portraits of people of European ethnicity. “We're planning to expand our dataset and fix this in the future. At the time of conceptualizing this AI, authors were not certain it would turn out to work at all. This is close to state of the art in AI at the moment. Sorry for the bias in the meanwhile,” the apology reads

Bringing Diversity to Training Data

The lack of diverse datasets is not limited to PortraitAI alone. There is a notable problem in AI when it comes to the diversity of the datasets used to train ML models. Most datasets are compiled and labeled by a specific type of person, which means people who do not fall into that category (ie people who are not white, male, straight, American or European, the list goes on and on) might struggle to find themselves represented in these datasets. With the field of AI still being disproportionately white and male, it makes sense that the datasets used contain data that looks the same. According to a report published by the AI Now Institute, more than 80% of AI professors are men, and women only make up about 15% of AI researchers working at Meta and 10% of AI researchers working at Google. It is hard to prevent or even notice racial and gender bias in datasets when the affected party is not in the room. 

To help solve this problem, researchers and technologists from marginalized communities and diverse backgrounds are creating resources and datasets to reduce bias in AI. One notable example, Diversity AI, is developing an open access dataset to allow developers to train their models on more diverse datasets. They have also created a range of guidelines and validation mechanisms to test AI systems for racial, gender, age, and ethnic bias. This could mean that bias in AI systems would be caught early, before much harm could be done. Diversity AI has also established a platform for thought leaders in AI to discuss racial and gender issues and strategies to decrease prejudice in AI. Other organizations, like Black in AI and Women in Machine Learning + Data Science, are supporting, mentoring, and funding people from marginalized groups who work in AI. By creating community and supporting their work, they hope to tackle the lack of diversity in the field of AI.

It doesn’t help that most of the internet is largely the product of European and US culture, which makes sense given that the creators of the internet as we know it are from the US. In fact, in 1997, about 80% of the World Wide Web’s content was in English. However, as the internet expanded to users from all over the world, the content of the internet has remained largely the same. For one, most of the internet is still in English (according to Web Technology Surveys, almost 60% of all websites are in English), automatically excluding users who speak one of the other thousands of languages spoken around the world. Because of this and the fact that most of the largest internet companies are based in the US, it is difficult to find data by and about other cultures outside the Global North. Since the internet is being mined for data to train AI models, this means that there might not be enough information to create diverse datasets.

As more and more AI tools are developed, it is easy to get caught up in the novelty of AI and not fully consider the impact. It’s true that we’re in exciting times when it comes to AI, especially with its potential for social good. Projects like Imago AI and ConvNetQuake are using AI to increase crop yields in poorer countries and analyze seismograms to predict earthquakes. Institutions are embracing AI and integrating it into their systems, and while this is all well and good, this attention on AI at the moment also means that there is no better time to have a discussion about diversity in datasets and the field of AI broadly. It is important to also think about the harm that AI tools may be causing and the ways that it can be rectified. This looks like investing in researching and creating diverse datasets, taking accountability for damage that is done and hiring a more diverse workforce for starters.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo