簡體   English   中英

讀取來自.csv 的 URL 列表,用於使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 進行抓取

[英]Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas

This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr.

在下面的代碼中,我只是將 3 個示例源 URL 粘貼到代碼中。 但是我有一長串要抓取的 URL(1000+),它們存儲在 a.csv 文件的第一列(我們稱之為“urllist.csv”)。 我寧願從那個文件中讀取。

我想我知道“打開時”的基本結構,但我在如何將其鏈接到代碼的 rest 時遇到問題。 您的幫助將不勝感激。

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

由於您使用的是 pandas, read_csv將為您解決問題: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.ZFC35FDC70D5FC69D2698ZA822E

如果你想自己寫,你可以使用內置的 csv 庫

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])


暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM