Kasia Chmielinski is the Co-Founder of the Data Nutrition Project and a technologist focused on building responsible data systems across industry, academia, government, and non-profit domains. Previously, they held positions at the United Nations (OCHA), US Digital Service (EOP / OMB), MIT Media Lab, McKinsey & Company, and Google. When not thinking about data, Kasia is usually cycling or birdwatching around the Northeastern US.
Fellow Followup November 3, 2023
Kasia Chmielinski is part of the 2023-24 cohort of DCSL/CCSRE Technology & Racial Equity Practitioner Fellows, and is co-founder of The Data Nutrition Project.
Over the past twenty years, Kasia has worked with companies including Google in product management. They’ve also worked on large scale products for kids and learning. Over the course of Kasia’s career, they have been keenly aware of how bias in data impacts bias in outcomes. Kasia is mostly concerned with how algorithms, as a process, contribute to those biases. The problem, Kasia notes, is if we don’t collect demographic data, then we don’t have the option to test for biases.
In 2018, Kasia was part of the Assembly Fellowship (Harvard/MIT) that inspired the creation of Data Nutrition Project, a non-profit organization that believes “technology should help us move forward without mirroring societal biases.”
One probing question that motivates and guides Kasia’s work is: Would it be possible to build a nutrition label for data in the same way that food nutrition labels help us decide if we want to buy the product?
As Kasia points out, there is not a similar process for data. “We’re in the wild west of data,” they note.
Another research question that drives Kasia’s fellowship project: How do we think about the use of demographic data for algorithms, and what are the kinds of ways we could leverage something like a nutrition label to highlight the issues with using or not using demographic data?
Kasia is especially focused on the communities they are part of and most familiar with, in this case, the AAPI community.
Ultimately, Kasia would like to create a set of guidelines to approach the conundrum of balancing the collection of demographic data with privacy concerns.
I spoke with Kasia recently to get a sense of what they’ve been up to since beginning the Practitioner Fellowship. We talked about lessons learned and what the DCSL community can do to continue to support their work. We also discuss zombie data, a term Kasia details in the interview below.
Note: This interview has been edited down for length and clarity.
Tara: Since starting your DCSL/CCSRE fellowship, what have you been up to and can you talk about some wins and lessons learned so far?
AI regulation has blown up in the last six months with the launch of ChatGPT and the EU regulation that’s coming out. So, I’ve been sucked into that conversation a lot, which has distracted me a little from the research, to be honest. But The Data Nutrition Project has been actively involved in conversations about transparency and accountability. So, I’m trying to figure out how to bring demographics and demographic data into current conversations.
Some of the issues I notice going on now include the idea that policymakers are still not talking about data and no one knows what ChatGPT is trained on. There’s also a lack of transparency around training datasets in general. People are so confused as to why training modules remain biased. One way to prove this is to ask ChaGPT a question in another language. You’ll find it degenerates pretty quickly because it wasn’t trained on languages that are not English and other major European languages. These are just some examples of where my attention is being pulled towards.
I’d say that the challenge for me has been the fact that the world is moving so quickly and so is my attention. I want to talk about demographic data used in algorithms while everyone else is talking about generative AI. So, I ask myself, do I approach my research by talking about generative AI, or is there a news hook to make it more relevant to the conversations that are happening now?
I think this question is a very important topic, but it seems to be missed because there’s all these factions in the AI world that people are talking about. These existential risks like the Terminator coming in and killing us resulting in the end of humanity, as opposed to the real problems that we’re all confronting right now. These systems are already biased and discriminating, and we need to focus on the problem at hand.
Tara: When you figure out a way to balance our projects with the current discourse and the ever-changing world, please let me know! [Laughing]. I completely empathize with you. I also get sucked into the current discourse as well. That said, how can the DCSL community continue to support your work currently and into the future?
Certainly, when I have something to workshop, it would be great to hear candid feedback from folks. That’s always helpful. I also have other projects happening that I think could be useful to plug into the community.
Not so much related to the fellowship, but I’ve been thinking about ways to get the word out about my work and interests that aren’t research papers. So, I’ve been toying around with the idea of creating a podcast about data and the way data impacts society. I’ve also been working on some scripts and pilots. It would be great to have some input from the DCSL/CCSRE community on those, more creative projects as well.
Lots of people have podcasts, so not just feedback, but also help thinking through production, distribution, and contacts in the industry. It’s always good to be part of a network of good, smart, and kind folks.
Tara: Is there anything else you’d like to share?
I think it could be cool to see if folks have interest in what I’m calling zombie datasets. This isn’t for my project, but rather for the podcast I’m working on. I’m essentially looking for really egregious datasets that just won’t die, but keep getting used over and over again to the detriment of society. I have a collection of datasets, probably too many actually. I want to collect them all like Pokemón. [Laughing].
For example, there’s a dataset called the Pima Indians Diabetes Database. It ended up being the most important dataset for diabetes, where even the definition of diabetes came from the original dataset and study. People keep using it not understanding that these numbers represent women’s bodies. It’s one of those datasets that are so much embedded in the culture of data curriculum that it will never die, even though it should.
So yeah, I’m basically proposing an open call for zombie datasets if anyone has any to discuss. I’m in the process of gathering stories now and doing pre-production for the podcast.
Tara: Would BMI numbers be considered a zombie dataset?
Kasia: Yes! BMI is a really good example of a metric that was built based on white guys within a certain age range who were also athletes. Like, really, how does that help us? That’s a really good example, yes.
Tara: Makes sense!
Fellow Follow-up is a new monthly feature in Signal/Noise: A Newsletter from the Stanford Digital Civil Society Lab aimed at showcasing the work of current and past DCSL/CCSRE Technology & Racial Equity Practitioner Fellows.