After working on my research for the past year, I can conclude that there is indeed gender bias in Wikipedia. According to measures of shortest paths and betweenness centrality, categories that mention men in the title are 11 times more central to Wikipedia than categories that mention women. In other words, it is much easier for any user to stumble upon a men-related category by chance. At the same time, there are many more categories that explicitly mention “women” in the title, most likely because women are historically not the “norm” in many areas. Therefore, there is both an underrepresentation and an overrepresentation of women in Wikipedia, which are strong signals of bias.
Working on my research for the past year has taught me many things. By focusing my research on gender bias in Wikipedia using graph theory, I reinforced my prior understanding of network characteristics such as “shortest paths” and learned about new characteristics such as “betweenness centrality.” In addition to this, I was able to learn a bit about data science along the way, which is something that interested me but I wasn’t able to learn about in my regular classes. Working on my research definitely took my learning to a deeper level and I was able to discover that data science is something that really interests me.
Aside from gaining important knowledge, I feel that I was able to achieve a lot of things by completing this research. First and foremost, I wrote and presented my honors thesis while being part of this program which is one of my biggest academic accomplishments to date. Secondly, this research lays the groundwork for us continuing and deepening our study which may eventually lead to its publication.
Collaborating with my mentor, Professor Genc, was extremely helpful in completing this research and being able to draw the conclusions that we did. I received a lot of guidance from him in terms of which questions I should be striving to answer, what kind of network characteristics I should be looking at, and best methods for obtaining data. We met weekly to discuss progress and define next steps, and I believe this kind of consistent communication was crucial to completing our study.
I am so glad I got the chance to be part of a research study like this. I feel that it really pushed me and challenged me in the best ways possible. It was a truly amazing way to finish off my undergraduate studies.
After completing and presenting the research I had done through December as part of my Honors thesis, Professor Genc and I have been continuing the research to go a bit further and ultimately publish our work. More specifically, we have been exploring additional measures we can take and analyze in terms of studying gender bias in Wikipedia data.
Working on this research has allowed me to turn theory into practice. For example, the heart of this research is graph theory and analyzing different characteristics of a particular network to gain insight about the data. Graph theory was part of my Computer Science curriculum at Pace but having the chance to apply that theory in research has greatly reinforced my understanding of the topic. Additionally, I have learned a lot about a couple of libraries in Python, specifically NetworkX and Matplotlib.
There have been many successes with our research, but not without any challenges. I would say the biggest success thus far has been completing and presenting the part of the research that was intended for my thesis. However, the process moved along quickly and there wasn’t that much time between deadlines, so I was often concerned that I would not have time to get enough data or different kinds of measurements to draw conclusions. Additionally, gathering the data was a bit difficult as well. There was a lot of information to filter through, and I often had to gather the same data multiple times because I would realize I was either missing something or counting something that should not have been included.
I am looking forward to going deeper with our research, adding more information to my thesis, and editing the paper to prepare it for publishing.
At this point, I have completed the research necessary for my honors thesis and will be presenting my paper this week. We determined that there exists a gender bias in Wikipedia. We found both over representation and under representation of women. There is over representation in the sense that there are six times more categories mentioning women in the title than men. But there is under representation in the sense that, in terms of shortest paths and measures of betweenness centrality, men are 11 times more central in Wikipedia than women.
Since I only had this one semester to complete my thesis, I was under a lot of time constraints and was simultaneously conducting research while writing pieces of my paper. Professor Genc and I met about once a week to touch base about the progress we were making, any new results, working through anything I was struggling with, and next steps in the research. The beginning of our research was a bit tough because it was hard to figure out where to begin. But once we started, the questions immediately began to flow more and more with every new bit of information that we uncovered and we were able to get an idea of what paths to take in our research.
Conducting this research taught me a lot about data science and I became much more familiar with Python as a programming language. I really enjoyed gathering and analyzing this data to the point where it oftentimes didn’t feel like work. I’ve gained a lot of interest in data science from doing this research and am fairly certain that I would like to pursue it in graduate school.
Although my honors thesis is complete, there is still much more research to be done and more things to uncover in terms of gender bias in Wikipedia. Therefore, Professor Genc and I will be continuing this research and extending it beyond my honors thesis.
My project revolves around detecting gender-biased data in Wikipedia. We are exploring the ontological structure of Wikipedia by using graph theory to determine if some kind of gender bias exists. Looking at bias in data is extremely important as the reliance on artificial intelligence for decision making grows. This is because artificial intelligence is trained with large data sets, and if the data set is biased, then the decisions that the artificial intelligence program makes will be biased as well. For example, recent studies have found AI-based decisions to show signs of racial bias in aiding criminal defense decisions, gender bias in supporting hiring decisions, and a mixture of biases in aiding policy making . Our goal is to potentially detect gender bias in Wikipedia’s data so that it may be minimized or even eliminated before it is used to train a program.
To do this, we will create a graph of Wikipedia where each page/category from Wikipedia will be a node and each link through Wikipedia’s categorical structure will be an edge . By looking at the way pages/categories are linked to each other, we can gauge the relatedness between them. For example, the Wikipedia “Housekeeping” page is related to the “Woman” page through 71 different categorical paths, while “Housekeeping” is related to “Man” in only 27 different paths, suggesting that the housekeeping concept is situated closer to women than men in the data .
Since the entirety of Wikipedia is an extremely large dataset, we will be using Python and some of its libraries such as NetworkX and Datashader to visualize and analyze our data. From our extremely large Wikipedia graph, we will then extract every page that mentions or is related to either men or women and create a smaller, much more manageable graph with only those nodes . We can look at only pages that mention or are related to men, only pages that mention or are related to women, and/or we can look at a combination of both . Next, we will look at shortest paths between specific, and possibly gendered, pages/categories to see how close each page is to the ultimate supercategory of “Woman” and the ultimate supercategory of “Man.” 
By working on this project, I expect to become very familiar with Python and some of its libraries such as NetworkX, Datashader, NumPy, and MatPlotLib. I expect to learn a lot about graph theory and how it is used to study networks as well. Finally, the ultimate goal is to eventually publish our findings from this project with my advisor which will be a huge achievement.
 My Honors Thesis Proposal