Everyone in the computer science field knows about GSoC (Google Summer of Code) or at least have heard about it from a geeky friend. It was started in 2005 by the founders of Google, Larry Page, and Sergey Brin. It’s an annual program for university students(over 18 years) worldwide, in which Google provides Stipend to the students who successfully complete a free and open-source software coding project over the summer with one of the host organizations registered with the GSoC program.
So, today we are going to implement some web scraping program using python to classify the Organizations which are registered with GSoC, on the basis of the technologies they are offering.
The first thing is to open your favorite code editor and create a new python file. I am gonna name it gsoc-tech-classifier.py
We will now import the required python libraries for this script.
requests module allows us to handle HTTP requests and returns a response object.
openpyxl module is used to read/write the required response data in an excel file.
bs4 or beautifulsoup4 is a module for pulling data out of XML or HTML files. Makes work easy for web scraping.
os module provides us a way with its utilities to interact with OS more
argparse module is used to read the command-line arguments for python scripts.
Because the script is going to make repeated HTTP requests, the server is likely to block the IP address. To prevent that we can use a VPN or other way is to use the fake_useragent module.
Now that we know what these libraries are used for, we are going to set up our UserAgent for random users. The below code snippet shows that.
Next taking the technology for which scaping is to be done as the command-line argument from the user (line 1–3) and save it to a variable (line 5).
We will use the GSoC archive page as the target (line 1 of below code snippet), I am going to use the URL of the 2020’s archive but you can change it very easily for any year.
Now we will create a response object to save the response from the HTTPS request on the URL, which will store all the response data from the GSoC organizations page (line 3–4).
We will now parse the HTML response data from the response object using bs4 module and store it in soup (line 1), and find all the HTML tags containing the name of all organizations, and store them into a soup object named organizations (line 3). Then we will find all the links to the organizations and store them in all_org_Link (line 5).
We will also create two lists, tech_Status for storing the status of technology (available or not) and org_Tech_URL to store links of all the organizations each combined with the GSoC URL (line 6–7).
The tags passed as argument in soup.select() and soup.find_all() function can be found when you inspect this webpage with ctrl+shift+i.
Now we will access each organization’s page and find the technologies used by it. Variable tech_index will be used to provide the index of true value to the list of the status of technology.
A for loop will be initialized over all the links of organizations and for each loop organization link will be saved in variable comp_Link (line 5) and concatenated with the standard GSoC URL and saved into variable comp_url (line 6).
This combined organization URL variable will be stored in the list org_Tech_URL (line 11). Another HTTP request will be made for each of these URLs (line12–13) and another soup object will be used to parse the response object (line 15) and finding all the technologies and storing it in comp_Tech for each organization individually (line 17).
A nested for loop will be looped over all the technologies an organization is offering and will be checked if the required technology exists within the domain of offered technologies by the Organization.
To save all the required data (name of the organization, status for a given technology, link to the organization) in a spreadsheet we will use openpyxl library, we will create a workbook, and data will be written into a sheet of this workbook.
(Line 4–7) are for providing column name to the spreadsheet. (Line 8) will initialize a loop over all the organizations and (Line 9–11) will write the data in the sheet. (Line 13–16) ensures that our script does not fail because of the already existing file. (Line 18) will save the spreadsheet to the local disk.
That’s all, We’re done. Simple enough, right?😉
The code of this script for reference is available at my GitHub account.
Classifies organizations in association with Google Summer of Code based on specific technology Classifies organizations…
ScreenShot of Spreadsheet