I recently spent some more time learning about Python as I am interested in its versatility and many applications. After learning some new content, including how to scrape HTML data off of a website, I asked my mom if she had any web-crawling tasks or ideas for me. She is currently designing her professional website and she said she wanted a social network graph representing how she is connected to her coauthors, and how they are connected to each other.
I figured this was a good of a challenge as any, so I embarked on a project that took a lot longer than I thought it would. The basic tasks I need to accomplish were as follows:
- Have Python go to her Google Scholar page and take down all of the relevant information
- Parse this data so I have a list of coauthors, and the papers that they have been on (that my mother has also been on)
- Feed this information into iGraph, a graph theory open-source module
- Display the information the desired format
So the essential start of this project consisted of me looking at this screen:
Originally, I was going to take the author names shown in gray under the title of each paper, but sadly, as you can see from the bottom two papers, if there are two many coauthors then not all of them are listed on this summary page.
Therefore, I had to find all of the links for these papers and step into each one to collect the data. The content in these links looks like this:
So now I had access to the entire list of coauthors and I could scrape it using Beautiful Soup to get a list of coauthors.
Within this process, some of the logic I used was obviously “if a coauthor with this name already exists, don’t create a new coauthor, but just add a paper to the existing coauthor”(Side note: I should mention that I had to create classes for the first time to accomplish this since I created “Coauthor” and “Paper” objects, so that was fun).
However, as is often true in programming, it wasn’t that easy. The biggest issue was that Google Scholar is inconsistent in their labeling of authors. For a given author “Paul Benjamin Lowry”, Google Scholar could show the name as “Paul Benjamin Lowry” or “PB Lowry”. This meant that I had to write code that would convert between initials and full names. This introduced the uncertainty where the computer could assume that a full name of “Paul Benjamin Lowry” is the same author as “PB Lowry”, but in reality, it could also belong to “Peter Brian Lowry”. I doubt this would be an issue in most scenarios that this code would be applied in, but it could cause issues, for example, if I was aiming to create a social network diagram for the entire academic world (Dream big right?).
So that was pretty annoying but manageable. Much to my dismay, things were further complicated when people’s middle names would spontaneously appear and disappear based off of the paper. Thus, I also had to make another assumption that “Harry George Styles” and “Harry Styles” are the same person. This would create some ambiguity as to whether “Harry George Styles” and “Harry Prudence Styles” are the same person in my code. But again, hopefully this would only blow things up when creating a really large-scale networks.
And of course there were even more issues with just the way the authors’ names were being presented **sigh**. My grandparents thought it was a good idea to make my mother’s middle name not “Rene” but “René”. Moreover, Google Scholar had her down on different papers as “Michelle Rene Lowry” and “Michelle René Lowry” (Where’s the quality control, am I right?). So then I had to do some magic-wand waving involving unicode transformations. Luckily for me, some poor chap already created a module for converting between pretentious letters and their humbler counterparts.
Digression: At some point in this process, I got a weird error where all of the content I was taking from Google Scholar changed. All of the sudden there were no coauthors and no links. Puzzled, I printed out the raw content of the website I was reading. It basically had changed from the normal Google Scholar page to a page saying something like “We have noticed unusual amounts of activity on this webpage from your network. This page is to prevent robots from accessing the information on this page, which is against our ‘Terms of Service’. Please complete the RECAPTCHA to proceed. Obviously, I couldn’t complete this with Python, so I was temporarily convinced I couldn’t complete the project. However, I realized that perhaps using a VPN would make Google Scholar think I was coming from a different location. It worked, and so for the rest of the project I used my VPN. Eventually, Google Scholar would catch on again, and then I would simply change locations. I felt very mischievous and hacker-esque. Anyway, back to the narrative.
So now I had officially spent a day or two just messing with the Python code to get a correct array of coauthors of my mother. Once that was done I need to find a way to communicate with iGraph, which my mother had used with R previously to do social-networking graphs. There is a Python version, but the documentation is a bit confusing, and it seems to be run by one guy on the internet (Tamas, kudos to you). So I spent quite a bit of time trying to figure out how give my data to the module.
After consulting my mother again, she said I needed to create an “Edge List”. What this essentially means is a file that contains a list of two vertices of the graph that are supposed to be connected. So if “James Ma”, “Helen Young”, and “Bert Jones” all worked on a paper together, the edge list for their network would look like this:
JamesMa HelenYoung HelenYoung BertJones BertJones JamesMa
So I ended up creating a function that would loop through my list of coauthors, looking for coauthors with the same paper, if that was found, they were added to a .txt file.
After this was all done, I was able to pass it onto the iGraph module. Great! Hooray! I’m done, right? Right??
Nope. It turns out that making the graph show itself was another beast. iGraph was the module for creating the graph, but apparently, I had to use another module named “Cairo” to show the graph. It turns out that installing Cairo is harder than you would think, and it took me a good 30-minutes of aimlessly shooting differently lines of code into my Terminal in fits of passion and rage to get it right. Eventually, I learned that I had to import “Cairoeiff” instead of “Cairo”, which would have been very helpful if it was on the instruction page…
So with my fingers crossed, I ran my code again praying to the Python gods for a graph to be created. Much to my excitement, a graph popped up! After consulting with my mother, and making some visual changes, the finished product is below, with “Me” being my mother, and each of the nodes being her coauthors.
I have uploaded my code onto GitHub here. I probably wouldn’t have normally, but I feel that this was an emotional investment and so it deserves a repository.
If I have time, I want to make it so that it can view multiple pages of content on Google Scholar (my mom only had enough content to fit one page). This would allow me to the same process for my dad who has hundreds of papers, creating a much more daunting graph.
Furthermore, I would like to add a feature where it could go one step deeper. This would mean I would not only be doing this process for my mom or my dad, but all of their coauthors as well and then combining all of the information into one graph. Yeah. That would be crazy. I don’t know if I have enough zeal to do that, also considering to run it once would probably take few hours…
Anyway, I learned a ton through this project and just going out and trying to accomplish a task. It was a lot of fun and look forward to my next whimsical Python project.