简体   繁体   中英

How to store elements of a list of HTML tags fetched with BeautifulSoup within a dataframe separated in alphabetically columns with pandas?

I am completely new to Jupiter Notebook, Python, Webscraping and stuff. I looked and different answers but no one seems to has the same problem (and I am not good in adapting "a similar" approach, change it a bit so I can use it for my purpose).

I want to create a data grid with all existing HTML tags. As source I am using MDN docs. It works find to get all Tags with Beautiful Soup but I struggle to go any further with this data.

Here is the code from fetching the data with beautiful soup

from bs4 import BeautifulSoup
import requests

url = "https://developer.mozilla.org/en-US/docs/Web/HTML/Element"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

get_nav_tag = soup.find("nav", class_="sidebar-inner")
get_second_div = nav_tag.find_all("div")[2]
get_ol = get_second_div.find("ol")
get_li = get_second_div.find_all("li", class_="toggle")[3]

tag_list = get_li.find_all("code")

print("There are currently", len(tag_list), "tags.")

for tags in tag_list:
    print(tags.text)

网页抓取的输出

The list is already sorted.

Now I work with Pandas to create a dataframe

import pandas as pd

tag_data = []
for tag in tag_list:
    tag_data.append({"Tags": tag.text})

df = pd.DataFrame(tag_data)
df

The output looks like

熊猫的输出

QUESTION

How do I create a dataframe where there are columns for each character and the elements are listed under each column?

Like:

  A         B      C 
1 <a>      <b>    <caption>
2 <abbr>   <body> <code>
3 <article> ..    ...
4 ...      ...    ...

How do I separate this list in more list corresponding to each elements first letter? I guess I will need it for further interactions as well, like creating graphs as such. Eg to show in a bar chart, how many tags starting with "a", "b" etc exists.

Thank you!

The code below should do the work.

df['first_letter'] = df.Tags.str[1]
tag_matrix = pd.DataFrame()

for letter in df.first_letter.unique():
    # Create a pandas series whose name matches the first letter of the tag and contains tags starting with the letter
    matching_tags = pd.Series(df[df.first_letter==letter].reset_index(drop=True).Tags, name=letter)
    # Append the series to the tag_matrix
    tag_matrix = pd.concat([tag_matrix, matching_tags], axis=1)    

tag_matrix

Here's a sample of the output: 在此处输入图像描述

Note that you might want to do some additional cleaning, such as dropping duplicate tags or converting to lower case.

You can use pivot and concat methods to achieve this

df["letter"] = df["Tags"].str[1].str.upper()
df = df.pivot(columns="letter", values="Tags")
df = pd.concat([df[c].dropna().reset_index(drop=True) for c in df.columns], axis=1)

This gives输出

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2025 STACKOOM.COM