Python Web 刮：Output 至 Z628CB5675FF524F3E719B7A2E

Question

I'm doing some progress with web scraping however I still need some help to perform some operations:我在 web 抓取方面取得了一些进展，但是我仍然需要一些帮助来执行一些操作：

import requests
import pandas as pd
from bs4 import BeautifulSoup




url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

for tr in soup.select('.col-md-4 tbody tr'):

On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table. On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table.

first name, last name, header table名字，姓氏，header 表

Any help would be appreciated.任何帮助，将不胜感激。

Answer 1

This is what I have done on my own:这是我自己做的：

import requests
import pandas as pd
from bs4 import BeautifulSoup





url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'


soup = BeautifulSoup(requests.get(url).content, 'html.parser')

filename = url.rsplit('/', 1)[1] + '.csv'


tables = soup.select('.col-md-4 table')
rows = []

for tr in tables:
    t = tr.get_text(strip=True, separator='|').split('|')
    rows.append(t)
    df = pd.DataFrame(rows)
    print(df)
    df.to_csv(filename)

Thanks,谢谢，

Answer 2

This might work:这可能有效：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []

for table in tables:
    cleaned = list(table.stripped_strings)
    header, names = cleaned[0], cleaned[1:]
    data = [name.split(', ') + [header] for name in names]
    rows.extend(data)

result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])

Answer 3

You need to first iterate through each table you want to scrape, then for each table, get its header and rows of data.您需要首先遍历要抓取的每个表，然后对于每个表，获取其 header 和数据行。 For each row of data, you want to parse out the First Name and Last Name (along with the header of the table).对于每一行数据，您要解析出名字和姓氏（以及表的 header）。

Here's a verbose working example:这是一个详细的工作示例：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):

    # Grab the header and rows from the table
    header = table.select("thead th")[0].text.strip()
    rows = [s.text.strip() for s in table.select("tbody tr")]

    t = []  # This list will contain the rows of data for this table

    # Iterate through rows in this table
    for row in rows:

        # Split by comma (last_name, first_name)
        split = row.split(",")

        last_name = split[0].strip()
        first_name = split[1].strip()

        # Create the row of data
        t.append([first_name, last_name, header])

    # Convert list of rows to a DataFrame
    df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])

    # Append to list of DataFrames
    out.append(df)

# Write to CSVs...
out[0].to_csv("first_table.csv", index=None)  # etc...

Whenever you're web scraping, I highly recommend using strip() on all of the text you parse to make sure you don't have superfluous spaces in your data.每当您进行 web 抓取时，我强烈建议您在解析的所有文本上使用strip()以确保您的数据中没有多余的空格。

I hope this helps!我希望这有帮助！

Python Web 刮：Output 至 Z628CB5675FF524F3E719B7A2E

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-06-01 12:19:06

解决方案2
1 2020-06-01 13:17:35

解决方案3
1 2020-06-01 20:36:58

Python Web 刮：Output 至 Z628CB5675FF524F3E719B7A2E

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-06-01 12:19:06

解决方案2 1 2020-06-01 13:17:35

解决方案3 1 2020-06-01 20:36:58

解决方案1
1 已采纳 2020-06-01 12:19:06

解决方案2
1 2020-06-01 13:17:35

解决方案3
1 2020-06-01 20:36:58