At this point, I have completed the research necessary for my honors thesis and will be presenting my paper this week. We determined that there exists a gender bias in Wikipedia. We found both over representation and under representation of women. There is over representation in the sense that there are six times more categories mentioning women in the title than men. But there is under representation in the sense that, in terms of shortest paths and measures of betweenness centrality, men are 11 times more central in Wikipedia than women.
Since I only had this one semester to complete my thesis, I was under a lot of time constraints and was simultaneously conducting research while writing pieces of my paper. Professor Genc and I met about once a week to touch base about the progress we were making, any new results, working through anything I was struggling with, and next steps in the research. The beginning of our research was a bit tough because it was hard to figure out where to begin. But once we started, the questions immediately began to flow more and more with every new bit of information that we uncovered and we were able to get an idea of what paths to take in our research.
Conducting this research taught me a lot about data science and I became much more familiar with Python as a programming language. I really enjoyed gathering and analyzing this data to the point where it oftentimes didn’t feel like work. I’ve gained a lot of interest in data science from doing this research and am fairly certain that I would like to pursue it in graduate school.
Although my honors thesis is complete, there is still much more research to be done and more things to uncover in terms of gender bias in Wikipedia. Therefore, Professor Genc and I will be continuing this research and extending it beyond my honors thesis.
My project revolves around detecting gender-biased data in Wikipedia. We are exploring the ontological structure of Wikipedia by using graph theory to determine if some kind of gender bias exists. Looking at bias in data is extremely important as the reliance on artificial intelligence for decision making grows. This is because artificial intelligence is trained with large data sets, and if the data set is biased, then the decisions that the artificial intelligence program makes will be biased as well. For example, recent studies have found AI-based decisions to show signs of racial bias in aiding criminal defense decisions, gender bias in supporting hiring decisions, and a mixture of biases in aiding policy making . Our goal is to potentially detect gender bias in Wikipedia’s data so that it may be minimized or even eliminated before it is used to train a program.
To do this, we will create a graph of Wikipedia where each page/category from Wikipedia will be a node and each link through Wikipedia’s categorical structure will be an edge . By looking at the way pages/categories are linked to each other, we can gauge the relatedness between them. For example, the Wikipedia “Housekeeping” page is related to the “Woman” page through 71 different categorical paths, while “Housekeeping” is related to “Man” in only 27 different paths, suggesting that the housekeeping concept is situated closer to women than men in the data .
Since the entirety of Wikipedia is an extremely large dataset, we will be using Python and some of its libraries such as NetworkX and Datashader to visualize and analyze our data. From our extremely large Wikipedia graph, we will then extract every page that mentions or is related to either men or women and create a smaller, much more manageable graph with only those nodes . We can look at only pages that mention or are related to men, only pages that mention or are related to women, and/or we can look at a combination of both . Next, we will look at shortest paths between specific, and possibly gendered, pages/categories to see how close each page is to the ultimate supercategory of “Woman” and the ultimate supercategory of “Man.” 
By working on this project, I expect to become very familiar with Python and some of its libraries such as NetworkX, Datashader, NumPy, and MatPlotLib. I expect to learn a lot about graph theory and how it is used to study networks as well. Finally, the ultimate goal is to eventually publish our findings from this project with my advisor which will be a huge achievement.
 My Honors Thesis Proposal