Web scraping is a technique used to extract data from websites. It allows us to gather information from web pages and use it for various purposes, such as data analysis, research, or building applications.
In this article, we will explore a Python project called “GitHub Topics Scraper,” which leverages web scraping to extract information from the GitHub topics page and retrieve repository names and details for each topic.
GitHub is a widely popular platform for hosting and collaborating on code repositories. It offers a feature called “topics” that allows users to categorize repositories based on specific subjects or themes. The GitHub Topics Scraper project automates the process of scraping these topics and retrieving relevant repository information.
The GitHub Topics Scraper is implemented using Python and utilizes the following libraries:
requests
: Used for making HTTP requests to retrieve the HTML content of web pages.BeautifulSoup
: A powerful library for parsing HTML and extracting data from it.pandas
: A versatile library for data manipulation and analysis, used for organizing the scraped data into a structured format.
Let’s dive into the code and understand how each component of the project works.
import requests
from bs4 import BeautifulSoup
import pandas as pd
The above code snippet imports three libraries: requests
, BeautifulSoup
, and pandas
.
def topic_page_authentication(url):topics_url = url
response = requests.get(topics_url)
page_content = response.text
doc = BeautifulSoup(page_content, 'html.parser')
return doc
Defines a function called topic_page_authentication
that takes a URL as an argument.
Here’s a breakdown of what the code does:
1. topics_url = url
: This line assigns the provided URL to the variable topics_url
. This URL represents the web page that we want to authenticate and retrieve its content.
2. response = requests.get(topics_url)
: This line uses the requests.get()
function to send an HTTP GET request to the topics_url
and stores the response in the response
variable. This request is used to fetch the HTML content of the web page.
3. page_content = response.text
: This line extracts the HTML content from the response object and assigns it to the page_content
variable. The response.text
attribute retrieves the text content of the response.
4. doc = BeautifulSoup(page_content, 'html.parser')
: This line creates a BeautifulSoup object called doc
by parsing the page_content
using the 'html.parser'
parser. This allows us to navigate and extract information from the HTML structure of the web page.
5. return doc
: This line returns the BeautifulSoup object doc
from the function. This means that when the topic_page_authentication
function is called, it will return the parsed HTML content as a BeautifulSoup object.
The purpose of this function is to authenticate and retrieve the HTML content of a web page specified by the provided URL. It uses the requests
library to send an HTTP GET request retrieves the response content, and then parses it using BeautifulSoup to create a navigable object representing the HTML structure.
Please note that the provided code snippet handles the initial steps of web page authentication and parsing, but it doesn’t perform any specific scraping or data extraction tasks.
def topicSraper(doc):# Extract title
title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':title_class})
# Extract description
description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class':description_class})
# Extract link
link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a',{'class':link_class})
#Extract all the topic names
topic_titles = []
for tag in topic_title_tags:
topic_titles.append(tag.text)
#Extract the descrition text of the particular topic
topic_description = []
for tag in topic_desc_tags:
topic_description.append(tag.text.strip())
#Extract the urls of the particular topics
topic_urls = []
base_url = "https://github.com"
for tags in topic_link_tags:
topic_urls.append(base_url + tags['href'])
topics_dict = {
'Title':topic_titles,
'Description':topic_description,
'URL':topic_urls
}
topics_df = pd.DataFrame(topics_dict)
return topics_df
Defines a function called topicScraper
that takes a BeautifulSoup object (doc
) as an argument.
Here’s a breakdown of what the code does:
1. title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
: This line defines the CSS class name (title_class
) for the HTML element that contains the topic titles on the web page.
2. topic_title_tags = doc.find_all('p', {'class':title_class})
: This line uses the find_all()
method of the BeautifulSoup object to find all HTML elements (<p>
) with the specified CSS class (title_class
). It retrieves a list of BeautifulSoup Tag objects representing the topic title tags.
3. description_class = 'f5 color-fg-muted mb-0 mt-1'
: This line defines the CSS class name (description_class
) for the HTML element that contains the topic descriptions on the web page.
4. topic_desc_tags = doc.find_all('p', {'class':description_class})
: This line uses the find_all()
method to find all HTML elements (<p>
) with the specified CSS class (description_class
). It retrieves a list of BeautifulSoup Tag objects representing the topic description tags.
5. link_class = 'no-underline flex-1 d-flex flex-column'
: This line defines the CSS class name (link_class
) for the HTML element that contains the topic links on the web page.
6. topic_link_tags = doc.find_all('a',{'class':link_class})
: This line uses the find_all()
method to find all HTML elements (<a>
) with the specified CSS class (link_class
). It retrieves a list of BeautifulSoup Tag objects representing the topic link tags.
7. topic_titles = []
: This line initializes an empty list to store the extracted topic titles.
8. for tag in topic_title_tags: ...
: This loop iterates over the topic_title_tags
list and appends the text content of each tag to the topic_titles
list.
9. topic_description = []
: This line initializes an empty list to store the extracted topic descriptions.
10. for tag in topic_desc_tags: ...
: This loop iterates over the topic_desc_tags
list and appends the stripped text content of each tag to the topic_description
list.
11. topic_urls = []
: This line initializes an empty list to store the extracted topic URLs.
12. base_url = "https://github.com"
: This line defines the base URL of the website.
13. for tags in topic_link_tags: ...
: This loop iterates over the topic_link_tags
list and appends the concatenated URL (base URL + href attribute) of each tag to the topic_urls
list.
14. topics_dict = {...}
: This block creates a dictionary (topics_dict
) that contains the extracted data: topic titles, descriptions, and URLs.
15. topics_df = pd.DataFrame(topics_dict)
: This line converts the topics_dict
dictionary into a pandas DataFrame, where each key becomes a column in the DataFrame.
16. return topics_df
: This line returns the pandas DataFrame containing the extracted data.
The purpose of this function is to scrape and extract information from the provided BeautifulSoup object (doc
). It retrieves the topic titles, descriptions, and URLs from specific HTML elements on the web page and stores them in a pandas data frame for further analysis or processing.
def topic_url_extractor(dataframe):url_lst = []
for i in range(len(dataframe)):
topic_url = dataframe['URL'][i]
url_lst.append(topic_url)
return url_lst
Defines a function called topic_url_extractor
that takes a panda DataFrame (dataframe
) as an argument.
Here’s a breakdown of what the code does:
1. url_lst = []
: This line initializes an empty list (url_lst
) to store the extracted URLs.
2. for i in range(len(dataframe)): ...
: This loop iterates over the indices of the DataFrame rows.
3. topic_url = dataframe['URL'][i]
: This line retrieves the value of the ‘URL’ column for the current row index (i
) in the data frame.
4. url_lst.append(topic_url)
: This line appends the retrieved URL to the url_lst
list.
5. return url_lst
: This line returns the url_lst
list containing the extracted URLs.
The purpose of this function is to extract the URLs from the ‘URL’ column of the provided DataFrame.
It iterates over each row of the DataFrame, retrieves the URL value for each row, and adds it to a list. Finally, the function returns the list of extracted URLs.
This function can be useful when you want to extract the URLs from a DataFrame for further processing or analysis, such as visiting each URL or performing additional web scraping on the individual web pages.
def parse_star_count(stars_str):stars_str = stars_str.strip()[6:]
if stars_str[-1] == 'k':
stars_str = float(stars_str[:-1]) * 1000
return int(stars_str)
Defines a function called parse_star_count
that takes a string (stars_str
) as an argument.
Here’s a breakdown of what the code does:
1. stars_str = stars_str.strip()[6:]
: This line removes leading and trailing whitespace from the stars_str
string using the strip()
method. It then slices the string starting from the 6th character and assigns the result back to stars_str
. The purpose of this line is to remove any unwanted characters or spaces from the string.
2. if stars_str[-1] == 'k': ...
: This line checks if the last character of stars_str
is ‘k’, indicating that the star count is in thousands.
3. stars_str = float(stars_str[:-1]) * 1000
: This line converts the numeric part of the string (excluding the ‘k’) to a float and then multiplies it by 1000 to convert it to the actual star count.
4. return int(stars_str)
: This line converts the stars_str
to an integer and returns it.
The purpose of this function is to parse and convert the star count from a string representation to an integer value. It handles cases where the star count is in thousands (‘k’) by multiplying the numeric part of the string by 1000. The function returns the parsed star count as an integer.
This function can be useful when you have star counts represented as strings, such as ‘1.2k’ for 1,200 stars, and you need to convert them to numerical values for further analysis or processing.
def get_repo_info(h3_tags, star_tag):
base_url = 'https://github.com'
a_tags = h3_tags.find_all('a')
username = a_tags[0].text.strip()
repo_name = a_tags[1].text.strip()
repo_url = base_url + a_tags[1]['href']
stars = parse_star_count(star_tag.text.strip())
return username, repo_name, stars, repo_url
Defines a function called get_repo_info
that takes two arguments: h3_tags
and star_tag
.
Here’s a breakdown of what the code does:
1. base_url = 'https://github.com'
: This line defines the base URL of the GitHub website.
2. a_tags = h3_tags.find_all('a')
: This line uses the find_all()
method of the h3_tags
object to find all HTML elements (<a>
) within it. It retrieves a list of BeautifulSoup Tag objects representing the anchor tags.
3. username = a_tags[0].text.strip()
: This line extracts the text content of the first anchor tag (a_tags[0]
) and assigns it to the username
variable. It also removes any leading or trailing whitespace using the strip()
method.
4. repo_name = a_tags[1].text.strip()
: This line extracts the text content of the second anchor tag (a_tags[1]
) and assigns it to the repo_name
variable. It also removes any leading or trailing whitespace using the strip()
method.
5. repo_url = base_url + a_tags[1]['href']
: This line retrieves the value of the ‘href’ attribute from the second anchor tag (a_tags[1]
) and concatenates it with the base_url
to form the complete URL of the repository. The resulting URL is assigned to the repo_url
variable.
6. stars = parse_star_count(star_tag.text.strip())
: This line extracts the text content of the star_tag
object removes any leading or trailing whitespace and passes it as an argument to the parse_star_count
function. The function returns the parsed star count as an integer, which is assigned to the stars
variable.
7. return username, repo_name, stars, repo_url
: This line returns a tuple containing the extracted information: username
, repo_name
, stars
, and repo_url
.
The purpose of this function is to extract information about a GitHub repository from the provided h3_tags
and star_tag
objects. It retrieves the username, repository name, star count, and repository URL by navigating and extracting specific elements from the HTML structure. The function then returns this information as a tuple.
This function can be useful when you want to extract repository information from a web page that contains a list of repositories, such as when scraping GitHub topics.
def topic_information_scraper(topic_url):
# page authentication
topic_doc = topic_page_authentication(topic_url)# extract name
h3_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class':h3_class})
#get star tag
star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
star_tags = topic_doc.find_all('a',{'class':star_class})
#get information about the topic
topic_repos_dict = {
'username': [],
'repo_name': [],
'stars': [],
'repo_url': []
}
for i in range(len(repo_tags)):
repo_info = get_repo_info(repo_tags[i], star_tags[i])
topic_repos_dict['username'].append(repo_info[0])
topic_repos_dict['repo_name'].append(repo_info[1])
topic_repos_dict['stars'].append(repo_info[2])
topic_repos_dict['repo_url'].append(repo_info[3])
return pd.DataFrame(topic_repos_dict)
Defines a function called topic_information_scraper
that takes a topic_url
as an argument.
Here’s a breakdown of what the code does:
1. topic_doc = topic_page_authentication(topic_url)
: This line calls the topic_page_authentication
function to authenticate and retrieve the HTML content of the topic_url
. The parsed HTML content is assigned to the topic_doc
variable.
2. h3_class = 'f3 color-fg-muted text-normal lh-condensed'
: This line defines the CSS class name (h3_class
) for the HTML element that contains the repository names within the topic page.
3. repo_tags = topic_doc.find_all('h3', {'class':h3_class})
: This line uses the find_all()
method of the topic_doc
object to find all HTML elements (<h3>
) with the specified CSS class (h3_class
). It retrieves a list of BeautifulSoup Tag objects representing the repository name tags.
4. star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
: This line defines the CSS class name (star_class
) for the HTML element that contains the star counts within the topic page.
5. star_tags = topic_doc.find_all('a',{'class':star_class})
: This line uses the find_all()
method to find all HTML elements (<a>
) with the specified CSS class (star_class
). It retrieves a list of BeautifulSoup Tag objects representing the star count tags.
6. topic_repos_dict = {...}
: This block creates a dictionary (topic_repos_dict
) that will store the extracted repository information: username, repository name, star count, and repository URL.
7. for i in range(len(repo_tags)): ...
: This loop iterates over the indices of the repo_tags
list, assuming that it has the same length as the star_tags
list.
8. repo_info = get_repo_info(repo_tags[i], star_tags[i])
: This line calls the get_repo_info
function to extract information about a specific repository. It passes the current repository name tag (repo_tags[i]
) and star count tag (star_tags[i]
) as arguments. The returned information is assigned to the repo_info
variable.
9. topic_repos_dict['username'].append(repo_info[0])
: This line appends the extracted username from repo_info
to the ‘username’ list in topic_repos_dict
.
10. topic_repos_dict['repo_name'].append(repo_info[1])
: This line appends the extracted repository name repo_info
to the ‘repo_name’ list in topic_repos_dict
.
11. topic_repos_dict['stars'].append(repo_info[2])
: This line appends the extracted star count repo_info
to the ‘stars’ list in topic_repos_dict
.
12. topic_repos_dict['repo_url'].append(repo_info[3])
: This line appends the extracted repository URL from repo_info
to the ‘repo_url’ list in topic_repos_dict
.
13. return pd.DataFrame(topic_repos_dict)
: This line converts the topic_repos_dict
dictionary into a pandas DataFrame, where each key becomes a column in the DataFrame. The resulting data frame contains the extracted repository information.
The purpose of this function is to scrape and extract information about the repositories within a specific topic on GitHub. It authenticates and retrieves the HTML content of the topic page, then extracts the repository names and star counts using specific CSS class names.
It calls the get_repo_info
function for each repository to retrieve the username, repository name, star count, and repository URL.
The extracted information is stored in a dictionary and then converted into a pandas DataFrame, which is returned by the function.
if __name__ == "__main__":
url = 'https://github.com/topics'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)# Make Other CSV files acording to the topics
url = topic_url_extractor(topic_dataframe)
name = topic_dataframe['Title']
for i in range(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Files/{name[i]}.csv', index=None)
The code snippet demonstrates the main execution flow of the script.
Here’s a breakdown of what the code does:
1. if __name__ == "__main__":
: This conditional statement checks if the script is being run directly (not imported as a module).
2. url = 'https://github.com/topics'
: This line defines the URL of the GitHub topics page.
3. topic_dataframe = topicSraper(topic_page_authentication(url))
: This line retrieves the topic page’s HTML content using topic_page_authentication
, and then passes the parsed HTML (doc
) to the topicSraper
function. It assigns the resulting data frame (topic_dataframe
) to a variable.
4. topic_dataframe.to_csv('GitHubtopics.csv', index=None)
: This line exports the topic_dataframe
DataFrame to a CSV file named ‘GitHubtopics.csv’. The index=None
argument ensures that the row indices are not included in the CSV file.
5. url = topic_url_extractor(topic_dataframe)
: This line calls the topic_url_extractor
function, passing the topic_dataframe
as an argument. It retrieves a list of URLs (url
) extracted from the data frame.
6. name = topic_dataframe['Title']
: This line retrieves the ‘Title’ column from the topic_dataframe
and assigns it to the name
variable.
7. for i in range(len(topic_dataframe)): ...
: This loop iterates over the indices of the topic_dataframe
DataFrame.
8. new_df = topic_information_scraper(url[i])
: This line calls the topic_information_scraper
function, passing the URL (url[i]
) as an argument. It retrieves repository information for the specific topic URL and assigns it to the new_df
DataFrame.
9. new_df.to_csv(f'GitHubTopic_CSV-Files/{name[i]}.csv', index=None)
: This line exports the new_df
DataFrame to a CSV file. The file name is dynamically generated using an f-string, incorporating the topic name (name[i]
). The index=None
an argument ensures that the row indices are not included in the CSV file.
The purpose of this script is to scrape and extract information from the GitHub topics page and create CSV files containing the extracted data. It first scrapes the main topics page, saves the extracted information in ‘GitHubtopics.csv’, and then proceeds to scrape individual topic pages using the extracted URLs.
For each topic, it creates a new CSV file named after the topic and saves the repository information in it.
This script can be executed directly to perform the scraping and generate the desired CSV files.
url = 'https://github.com/topics'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)
Once this code runs, it will generate a CSV file name as ‘GitHubtopics.csv’, which looks like this. and that csv covers all the topic names, their description, and their URLs.
url = topic_url_extractor(topic_dataframe)
name = topic_dataframe['Title']
for i in range(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Files/{name[i]}.csv', index=None)
Then this code will execute to create the specific csv files based on topics we saved in the earlier ‘GitHubtopics.csv’ file. Then those CSV files are saved in a directory called ‘GitHubTopic_CSV-Files’ with their own specific topic names. Those csv files look like this.
These Topic csv files stored some information about the topic, such as their Username, Repository name, Stars of the Repository, and the Repository URL.
Note: The tags of the website may change, so before running this python script, check the tags once according to the website.
Access of full Script >> https://github.com/PrajjwalSule21/GitHub-Topic-Scraper/blob/main/RepoScraper.py