简体   繁体   English

Beautiful Soup 从静态网页中抓取数据

[英]Beautiful Soup to Scrape Data from Static Webpages

I am trying to values from a table of multiple static webpages.我正在尝试从多个静态网页的表中获取值。 It is the verb conjugation data for Korean verbs here: https://koreanverb.app/这是韩语动词的动词变位数据: https : //koreanverb.app/

My Python script uses Beautiful Soup.我的 Python 脚本使用 Beautiful Soup。 The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.目标是从多个 URL 输入中获取所有共轭并将数据输出到 CSV 文件。

Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row".共轭存储在具有“table-responsive”类的表中的页面上,以及类“共轭行”的表行下。 There are multiple "conjugation-row" table rows on each page.每页上有多个“共轭行”表行。 My script is someone only grabbing the first table row with class "conjugation-row".我的脚本是某人只抓取带有“共轭行”类的第一个表格行。

Why isn't the for loop grabbing all the td elements with class "conjugation-row"?为什么 for 循环不使用类“共轭行”抓取所有 td 元素? I would appreciate a solution that grabs all tr with class "conjugation-row".我会很感激一个解决方案,它可以用“共轭行”类来获取所有 tr。 I tried using job_elements = results.find("tr", class_="conjugation-row") , but I get the following error:我尝试使用job_elements = results.find("tr", class_="conjugation-row") ,但出现以下错误:

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Furthermore, when I do get the data and output to a CSV file, the data is in separate rows as expected, but leaves empty spaces., It places the data rows for the second URL at the index after all data rows for the first URL.此外,当我确实获取数据并输出到 CSV 文件时,数据按预期位于单独的行中,但留有空格。, 它将第二个 URL 的数据行放在第一个 URL 的所有数据行之后的索引处. See example output here:在此处查看示例输出:

在此处输入图片说明

See code here:在这里查看代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)

## define first URL to grab conjugation names
url1 = 'https://koreanverb.app/?search=%ED%95%98%EB%8B%A4'

# define dataframe columns
df = pd.DataFrame(columns=['conjugation name'])

# get URL content
response = requests.get(url1)
soup = BeautifulSoup(response.content, 'html.parser')
    
# get table with all verb conjugations
results = soup.find("div", class_="table-responsive")


##### GET CONJUGATIONS AND APPEND TO CSV

# define URLs
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4', 
        'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']

# loop to get data
for url in urls:
    response = requests.get(url)
    soup2 = BeautifulSoup(response.content, 'html.parser')
    
    # get table with all verb conjugations
    results2 = soup2.find("div", class_="table-responsive")
    
    # get dictionary form of verb/adjective
    verb_results = soup2.find('dl', class_='dl-horizontal')
    verb_title = verb_results.find('dd')
    verb_title_text = verb_title.text

    job_elements = results2.find_all("tr", class_="conjugation-row")
    for job_element in job_elements:
        conjugation_name = job_element.find("td", class_="conjugation-name")
        conjugation_korean = conjugation_name.find_next_sibling("td")
        conjugation_name_text = conjugation_name.text
        conjugation_korean_text = conjugation_korean.text
        data_column = pd.DataFrame({    'conjugation name': [conjugation_name_text],
                                        verb_title_text: [conjugation_korean_text],

        })
        #data_column = pd.DataFrame({verb_title_text: [conjugation_korean_text]})        
        df = df.append(data_column, ignore_index = True)
        
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Verb Conjugations Collected and Appended to CSV, one per column')

Get all the job_elements using find_all() since find() only returns the first occurrence and iterate over them in a for loop like below.使用find_all()获取所有 job_elements,因为find()只返回第一次出现并在如下所示的for循环中迭代它们。

job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
    conjugation_name = job_element.find("td", class_="conjugation-name")
    conjugation_korean = conjugation_name.find_next_sibling("td")
    conjugation_name_text = conjugation_name.text
    conjugation_korean_text = conjugation_korean.text

    # append element to data
    df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
    df = df.append(df2)

The error is where you are trying to use find() on a variable of type list .错误是您尝试在list类型的变量上使用find()

Use of find_all helps to get the correct td elements then you can use find_next to get the following unclassified td.使用 find_all 有助于获得正确的 td 元素,然后您可以使用 find_next 来获得以下未分类的 td。 Also, I don't think pandas is really necessary for this triviality.另外,我不认为熊猫对于这种琐碎的事情真的很有必要。

import requests
from bs4 import BeautifulSoup as BS

urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']
CSV = 'scrape.csv'

with open(CSV, 'w') as csv:
    print('conjugation_name, conjugation_korean', file=csv)
    
with requests.Session() as session:
    for url in urls:
        r = session.get(url)
        r.raise_for_status()
        soup = BS(r.text, 'lxml')
        td = soup.find_all('td', class_='conjugation-name')
        with open(CSV, 'a') as csv:
            for _td in td:
                print(f'{_td.text}, {_td.find_next().text}', file=csv)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM