When you think of the word “superhero” what do you imagine? Language reflects and reinforces social norms; ungendering language is a vital part of interrogating sexism. However, there’s no dataset of gendered words. This talk is about data – where to get it and how to create it if it doesn’t exist. In this talk, LinkedIn software engineer Omayeli Arenyeka creates the dataset for The Gendered Project, showing how to view unavailable data as an opportunity rather than an obstacle to answering questions.
Gretchen DeKnikker: Hey, everybody. Welcome back. Our next session here is with Omayeli Arenyeka. Arenyeka, tell me I’m saying it right.
Omayeli Arenyeka: Arenyeka.
Gretchen DeKnikker: All right.
Gretchen DeKnikker: This is important to get people’s names right. So she is a software engineer at Linkedin. She is also an artist and a poet from Nigeria. She submitted her talk to us through our speaker submissions on how she built a gendered dictionary and we thought it was so interesting that we invited her to come here and share it with you guys today, so …
Gretchen DeKnikker: Also, the videos will be available later. Don’t forget to tweet with hashtag, #GGXElevate. We’ve got the Q&A going in the bottom. And just after this session we will give away some more socks, so stay tuned.
Omayeli Arenyeka: That’s good?
Gretchen DeKnikker: Yep. I see you.
Omayeli Arenyeka: Okay.
Omayeli Arenyeka: I’m Yeli. Thank you so much for having me. My talk is about building a gender dictionary. But before we get into all the technical stuff I wanted to play a little word game. And it’s simple, you don’t have to do anything but think really hard.
Omayeli Arenyeka: So the game is, I say a word and you think of an image that’s associated with it.
Omayeli Arenyeka: Okay, here we go. Superhero. Ninja. Hacker. Rockstar.
Omayeli Arenyeka: And then now I want you to consider the images that came up, if they were of humans, whether those images were of a man or a woman or someone who doesn’t exist in those binaries. And this isn’t to shame anybody, it’s just an opportunity to reflect on biases because those biases we have they make their way into things that are supposed to be objective. So when given the option, translating from English to French, machine assisted language translation systems, like Google Translate, code the word nurse as feminine.
Omayeli Arenyeka: So in the Turkish language they use a gender-neutral pronoun that covers he, she, it. So when Google Translate goes from Turkish to English it has to decide whether the gender-neutral pronoun means he or she or it.
Omayeli Arenyeka: So this poem is written by Google Translate on the topic of gender, and is a result of translating Turkish sentences that use that gender neutral pronoun into English, so some of the lines are, he’s a teacher. He’s a soldier. She’s a teacher. He’s a doctor. She’s a nurse. She’s a nanny. He’s a painter. He’s an engineer. He’s a president. He’s an artist. He’s a lawyer.
Omayeli Arenyeka: And so, the algorithm in basing its translations on a huge corporate set of human language, so it’s reflecting the bias, a gender bias that already exists in the English language.
Omayeli Arenyeka: Another example of the effect of gendered language was highlighted be the augmented writing platform, Textio. They found that the gendered language in your job posting can predict a higher … can predict the gender of the person that you hire.
Omayeli Arenyeka: So thinking about this and other ways that our everyday gendered language communicates ideas we might not mean to, I decided I wanted to create something to allow [inaudible 00:03:45] for gender language, and that’s what this talk is about, Building a Gendered Dictionary.
Omayeli Arenyeka: So specifically, I wanted to make an API and a tool where you could find all the gendered words, you could find the equivalent of a gendered word. To be clear, what a gendered word is, they’re words that apply to a certain gender. So, lady/gentlemen, prince/princess, so some of them like lady and gentlemen, prince and princess, they have equivalents and some of them don’t. Some of them, like actor, are not gendered in definition but might be gendered in practice.
Omayeli Arenyeka: So the first question that I had to ask was, where and how do I get this data? So there are some existing data sets of gendered words. One of those examples is from a team of Boston … a team of researchers from Boston University and Microsoft Research. They created a data set that’s part of their work into removing the sexist biases that exist in corpus in data sets that train algorithms, like Google Translate. So they were trying to remove the bias from platforms like Google Translate.
Omayeli Arenyeka: But unfortunately, all the data sets I found, including that one, were not substantial enough. At most they had 1,000 words, and a lot of the words were false positives, so they weren’t actually gendered words. So I decided I would use these methods, API, static data, and web scraping to get the data.
Omayeli Arenyeka: So to start with, I had to determine what a gendered word was, so what I would I tell the computer that a gendered word was? So to start, it was all the words in the dictionary that have at least one of these terms in it, so woman, female, girl, lady, man, male, boy. For example, businessman has the word man in its definition and archeress has the word female, so both of them would count as gendered words.
Omayeli Arenyeka: Then I started looking for some APIs. So I found one of the largest … the biggest online English dictionary by number of words, Wordnik has an online API and it has a free … it has a reverse dictionary feature, which means find all the words that have one of those terms in their definition. So you can see on this screenshot, the reserve dictionary of woman is all the words in the dictionary that have the word woman in their definition. So airwoman would have the word woman in its definition, so it would count as a reverse dictionary term.
Omayeli Arenyeka: So Wordnik has a client for interacting with the APIs, so I just used that to make a call to their reverse dictionary. You can see that happening in line seven. I have all the terms and then I make the call to the Wordnik API in line 10.
Omayeli Arenyeka: So I got about 400 words back, which was kind of confusing because the API said that there were over 3,000 words that were … that had the word woman in their definition. So I had to find another data set, so I stored the 400 words I got from the Wordnik reverse dictionary API and then moved on to the second way of getting data, static data sets.
Omayeli Arenyeka: So I looked on GitHub and I found a dictionary in JSON format and I read that in using Python. So Python has a JSON module that you can just import. So I loaded that in for filtering and I got all the definitions of the word, as you can see on line six.
Omayeli Arenyeka: So, like I said, if a word has one of these terms in its definition, then it’s a gendered word. So how do we check that? With Python you can say, “If string in definition.” So if woman in definition, or female in definition, or lady in the definition, but then you have this long list of conditions. So instead of doing that we can use RegEx. So, for example, my name is Omayeli, but a lot of people often misspell it, so I could use RegEx to create one pattern that matches my name and all the misspelling of my name.
Omayeli Arenyeka: So I created a RegEx pattern for all of these terms, so I could search them in definitions and see if the word was a gendered word. So in RegEx the pipe symbol represents or, so this is saying match woman or female or girl. And then, if you find any of these strings … if you find any of these words in the string, you can see patterned at search definition, it’s searching the definition for one of those patterns. And if you find it in the definition, then we do something. But the issue with that is that it wasn’t looking for whole words, so sub-strings also count. You can see on the right the words that are matched, human, manhole, so these are not gendered words. They have the word man in them but it’s just a part of the word and not the full word.
Omayeli Arenyeka: So I had to use word boundaries, so word boundary allows you to perform a “whole words only” search. So now it’s looking for whole words and not just part of a word. So you can see on the right, it no longer matches manhole and manatee. It only matches man and boy at the bottom.
Omayeli Arenyeka: But then I also want words like grandfather, so what do I do? So word characters, which matches a word character. So anything from A to Z, zero to nine. So now it finds father and all the words that are combinations of father and another word in front.
Omayeli Arenyeka: So going step-by-step through the patterns, these parentheses are for grouping a pattern together as one. This character set says, “Match anything in this set.” So you don’t have to match all of them, but you just have to match one thing in that set. This is, like I said, matching a word character. This is matching a dash. And then this is saying it’s optional, so there can be something before the word but there doesn’t have to be.
Omayeli Arenyeka: And these are the final RegEx patterns. They’re pretty long. So after I finalized the pattern I went through the dictionary and for each entry in the dictionary, if the definition contained one of those terms then I added it to the list of gendered words. So in line eight it’s checking if any of those terms are in the definition, then we add that to our list of gendered words.
Omayeli Arenyeka: So when I add that together, the words from Wordnik and Webster and some other files, it came to about 8,000, which is great. Much more the 400 that I started with. But then when I went through the list there is words that did not belong there, words like lioness. So for my definition of what I wanted this gendered dictionary to be, it was a collection of gendered words for human beings, so not animals. So this was not a word that I wanted in my word set.
Omayeli Arenyeka: So instead … So I decided I would start to look for patterns in the incorrect words, so find … what were the common things in the definitions of the words that were not supposed to be in the set? So one of the patterns of incorrect words that I found was that in some of the incorrect words the definition included the gender term being used as the object of a preposition. So, for example, in the definition of waterfall it says, “An arrangement of a woman.” In the definition of Peter is says, “A common baptismal name for a man.” So it’s not a name that is describing a man. It’s a name for a man, so Peter shouldn’t be a gendered word. And you can see in the other definitions, “Short cape worn by woman,” or, “The position of a man.”
Omayeli Arenyeka: So how would I remove words that fit this category, and the category being the gendered word is being used as the object of a preposition? So first, I had to isolate the part of the string that I wanted to look at, and that was everything before the gendered word. So you can see, the highlighted portion is everything before the word … before and including the word man. So we can use … in Python, we can use RE module, which is for handling RegEx expressions. So the RE search method in line four will search through the text for any of those terms, any of our gender terms in line two. So in this case we have the string definition in line eight, so it’s looking for the word man and it will find the word man.
Omayeli Arenyeka: And then we get … When we get the location of where the word is, you get the end index. So in line nine you can see the search method in RegEx returns the index of where the word was found. So, from there we can get the end index and then we can use that to trim the string. So in line 11 you can see that the word, it’s now a common baptismal name for a man and it doesn’t include everything after.
Omayeli Arenyeka: So after we trim it, remove any punctuation, we use the string class in Python. The string class has a list of punctuations, so we use that to filter in line five and then we return the string without any of the punctuations. So if there was a punctuation it would remove it. So if there was a string in line seven, the return string would be line nine, so, which no punctuations.
Omayeli Arenyeka: So now that we have this trimmed definition we can use NLTK to find where the preposition in is the string. NLTK stands for Natural Language Toolkit. It’s used for processing the English language. So the first thing we do is we tokenize it, so tokenization is the process of chopping up a string into different pieces that are called tokens, and then throwing away certain characters like punctuation. So you can see on the right we pass in … This is an online version of a tokenizer. We pass in a common baptismal name for a man, and then it breaks it up into different tokens.
Omayeli Arenyeka: After we tokenize we can use something called a part of speech tagger, so I load in the part of speech tagger in line … in two, it’s part of NLTK. In line five I tokenize the string. So you can see, in line six, it has … the definition is chopped up into different pieces. And then in line eight we use the tokenizer from NLTK, which gives every token a part of speech. So you can see, common is an adjective, baptismal is an adjective, man is a noun. And I know those because I looked up what the tags were in NLTK. So you can see, NN represents a noun, so man is a noun. JJ represents adjectives, so baptismal is an adjective. And then after that, first of all, we remove the a’s and the and’s and the the’s because we don’t really care about them, so we remove them in line four and then we get … In line nine we get the word before the gendered word. So we know that the gendered word is the last word in the sentence. It’s the last word in the string, so we get the word before that, and then we check to see if that’s a preposition.
Omayeli Arenyeka: So in that case–this case, a baptismal name for a man, the word before man is for, which is a preposition so it returns false and says, “This is not a gendered word.”
Omayeli Arenyeka: Another pattern was that there were a lot of clothing items. So you can see skivvies, pajama, loose-fitting trousers, all of these are not gendered words. So I found a list of clothing items so I can remove any words that has one of these clothing items in the definition. Unfortunately, the website where I found the list, I had to apply for an API key. I did apply like six months ago and they didn’t get back to me, so I decided I would scrape their website. So web scraping is a tool for extracting information from websites that involves grabbing the html that makes up the website. And for doing this in Python there are two libraries I usually use, urllib.request and Beautiful Soup.
Omayeli Arenyeka: So the first thing you have to do is figure out how the data you want is structured in the dom, which you can do using the inspector tab of your browser. So we see that the data I want is in a link. That’s the child of a span element with a class TD. So I open the URL of the page in line two. In line three I add it to Beautiful Soup. In line four, and then in line five I look for the specific elements. So I’m trying to find links that are the children of spans with class TD. And then I get all the texts for them and I have the list of clothing items. So I use that to filter the dataset to remove any clothing items that are disguising themselves as gendered words.
Omayeli Arenyeka: And it came down to about 4,000 words. The last thing I wanted to do was find gender opposites, so I wanted to match words with their opposites, king/queen, father/mother. I can use that … I can do that using something called Word2Vec. So Word2Vec is an algorithm that transforms words into vectors. So back in 2013, a handful of researchers at Google set loose a neural net on a large corpus of about three million words taking from Google News texts. So the goal was to look for patterns in the way words appear next to each other. So you can see in the graph, microwave is close to refrigerator and it’s far from the word grass. Grass is close to garden, is close to hose and sprinkler. So the Google team discovered that it could represent these patterns between words using vectors and vector space. So words with similar meanings would occupy similar parts of the vector space and the relationships between words could be captured by simple vector algebra.
Omayeli Arenyeka: So these relationships are known as word embeddings and the dataset is called Word2Vec. It’s based on the idea that a word is characterized by the company it keeps, so a word is close to another word in space if they appear in the same context. For example, if we give the algorithm this text, since salt and seasoning appear within the same context, the model it creates will indicate that salt is conceptually closer to seasoning than, say, chair. And with that model, and with those word vectors we can do stuff like getting the similarity of words. You can see woman and rectangle are not very similar. Their similarity value is less than 0.1, whereas the similarity value for woman and wife is 0.8.
Omayeli Arenyeka: And word analogies, so you can do it for woman. You can do woman is to queen as man is to king.
Omayeli Arenyeka: In Python, if you wanna use models and the Word2Vec algorithm you can use a library called Gensim. So I mentioned earlier that they used … they set loose a neural net on a model of like three million words, so you can load that model of three million words into Python using Gensim. So it’s called Google News Vectors, so you can see in line two we have a model, Google News Vectors, and we load that in and then we have … we can call the models most similar method in order to get the equivalent of a word. And this isn’t perfect, but I think it works for most of my cases.
Omayeli Arenyeka: So in line nine we pass in woman, wife. So positive, woman is to wife, and then we pass in man, and then the results we get if the score is greater than 0.6 then we say it’s an equivalent. And so, we have …
Omayeli Arenyeka: And that was it. I got an initial word set using APIs and finding static dataset, and then I cleaned and filtered the dataset using RegEx, web scraping and NLTK, and then I used Word2Vec to find the equivalent for words that have them. And then I created a website and an API to house the data. So you can see, woman is to wife as man is to husband. So when navigating the site users can learn what words are specific to a gender, what words have gender equivalents, what words don’t, and which ones significant imbalances exist. And you can see what words have undergone semantic derogation, which is a process where [inaudible 00:20:25] take on more negative connotations. For example, the word mistress was more … was once the equivalent of the word master but over time it’s taking on a new meaning.
Omayeli Arenyeka: So in summary, the words that we don’t and do have matter. They reflect our biases and the ideas that we value [inaudible 00:20:42] risk reinforcing perpetrating those biases if we don’t [inaudible 00:20:42] the words we use and why.
Omayeli Arenyeka: Thank you.
Gretchen DeKnikker: Thank you so much, Omayeli. This was great. I wish you … Go back in and read the comments, because everyone was so excited about how you broke down this search methodology and the test you did at the beginning. Everyone was like, “Oh, I thought of a man, too.” So everyone really, really enjoyed it. Unfortunately, we don’t have time for Q&A and I know we’re missing those a little bit today, so don’t worry everybody. We have a list of the questions and we can go back and do more in-depth interviews with all of the speakers later. So your questions will get answered at some point. Thank you again.