繁体   English   中英

使用 BeautifulSoup 并从 CSV 读取目标 URL 的问题

[英]Issue using BeautifulSoup and reading target URLs from a CSV

当我对 URL 变量使用单个 URL 进行刮擦时,一切都按预期工作,但在尝试从 Z628CB5675FF524F3E719B7AA2E88FE3F 读取链接时没有得到任何结果。 任何帮助表示赞赏。

关于 CSV 的信息:

  • 一列 header 称为“链接”
  • 300行没有空格的链接,逗号,; 或链接之前/之后的其他章程
  • 每行一个链接
    import requests  # required to make request
    from bs4 import BeautifulSoup  # required to parse html
    import pandas as pd
    import csv
    
    with open("urls.csv") as infile:
        reader = csv.DictReader(infile)
        for link in reader:
            res = requests.get(link['Links'])
            #print(res.url)
    url = res
    
    page = requests.get(url)
    
    soup = BeautifulSoup(page.text, 'html.parser')
    
    email_elm0 = soup.find_all(class_= "app-support-list__item")[0].text.strip()
    email_elm1 = soup.find_all(class_= "app-support-list__item")[1].text.strip()
    email_elm2 = soup.find_all(class_= "app-support-list__item")[2].text.strip()
    email_elm3 = soup.find_all(class_= "app-support-list__item")[3].text.strip()
    
    final_email_elm = (email_elm0,email_elm1,email_elm2,email_elm3)
    
    
    print(final_email_elm)
    
    df = pd.DataFrame(final_email_elm)
    
    #getting an output in csv format for the dataframe we created
    #df.to_csv('draft_part2_scrape.csv')

问题在于这部分代码:

with open("urls.csv") as infile:
    reader = csv.DictReader(infile)
    for link in reader:
        res = requests.get(link['Links'])
...

循环执行后, res将拥有最后一个链接。 所以,这个程序只会抓取最后一个链接。

为了解决这个问题,将所有链接存储在一个列表中并迭代该列表以抓取每个链接。 您可以将抓取的结果存储在单独的dataframe中,并在最后将它们连接起来以存储在一个文件中:

import requests  # required to make request
from bs4 import BeautifulSoup  # required to parse html
import pandas as pd
import csv

links = []
with open("urls.csv") as infile:
    reader = csv.DictReader(infile)
    for link in reader:
        links.append(link['Links'])
        

dfs = []
for url in links:
    page = requests.get(url)

    soup = BeautifulSoup(page.text, 'html.parser')

    email_elm0 = soup.find_all(class_="app-support-list__item")[0].text.strip()
    email_elm1 = soup.find_all(class_="app-support-list__item")[1].text.strip()
    email_elm2 = soup.find_all(class_="app-support-list__item")[2].text.strip()
    email_elm3 = soup.find_all(class_="app-support-list__item")[3].text.strip()

    final_email_elm = (email_elm0, email_elm1, email_elm2, email_elm3)
    print(final_email_elm)

    dfs.append(pd.DataFrame(final_email_elm))


#getting an output in csv format for the dataframe we created
df = pd.concat(dfs)
df.to_csv('draft_part2_scrape.csv')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM