I am completely new to Jupiter Notebook, Python, Webscraping and stuff. I looked and different answers but no one seems to has the same problem (and I am not good in adapting "a similar" approach, change it a bit so I can use it for my purpose).
I want to create a data grid with all existing HTML tags. As source I am using MDN docs. It works find to get all Tags with Beautiful Soup but I struggle to go any further with this data.
from bs4 import BeautifulSoup
import requests
url = "https://developer.mozilla.org/en-US/docs/Web/HTML/Element"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
get_nav_tag = soup.find("nav", class_="sidebar-inner")
get_second_div = nav_tag.find_all("div")[2]
get_ol = get_second_div.find("ol")
get_li = get_second_div.find_all("li", class_="toggle")[3]
tag_list = get_li.find_all("code")
print("There are currently", len(tag_list), "tags.")
for tags in tag_list:
print(tags.text)
The list is already sorted.
import pandas as pd
tag_data = []
for tag in tag_list:
tag_data.append({"Tags": tag.text})
df = pd.DataFrame(tag_data)
df
The output looks like
How do I create a dataframe where there are columns for each character and the elements are listed under each column?
Like:
A B C
1 <a> <b> <caption>
2 <abbr> <body> <code>
3 <article> .. ...
4 ... ... ...
How do I separate this list in more list corresponding to each elements first letter? I guess I will need it for further interactions as well, like creating graphs as such. Eg to show in a bar chart, how many tags starting with "a", "b" etc exists.
Thank you!
The code below should do the work.
df['first_letter'] = df.Tags.str[1]
tag_matrix = pd.DataFrame()
for letter in df.first_letter.unique():
# Create a pandas series whose name matches the first letter of the tag and contains tags starting with the letter
matching_tags = pd.Series(df[df.first_letter==letter].reset_index(drop=True).Tags, name=letter)
# Append the series to the tag_matrix
tag_matrix = pd.concat([tag_matrix, matching_tags], axis=1)
tag_matrix
Here's a sample of the output:
Note that you might want to do some additional cleaning, such as dropping duplicate tags or converting to lower case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.