讀取來自.csv 的 URL 列表，用於使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 進行抓取

Question

This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr.

在下面的代碼中，我只是將 3 個示例源 URL 粘貼到代碼中。 但是我有一長串要抓取的 URL（1000+），它們存儲在 a.csv 文件的第一列（我們稱之為“urllist.csv”）。 我寧願從那個文件中讀取。

我想我知道“打開時”的基本結構，但我在如何將其鏈接到代碼的 rest 時遇到問題。 您的幫助將不勝感激。

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

Answer 1

由於您使用的是 pandas， read_csv將為您解決問題： https://pandas.pydata.org/docs/reference/api/pandas.read_csv.ZFC35FDC70D5FC69D2698ZA822E

如果你想自己寫，你可以使用內置的 csv 庫

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])

讀取來自.csv 的 URL 列表，用於使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 進行抓取

問題描述

1 個解決方案

解決方案1
0 2021-11-27 22:08:10

讀取來自.csv 的 URL 列表，用於使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 進行抓取

問題描述

1 個解決方案

解決方案1 0 2021-11-27 22:08:10

解決方案1
0 2021-11-27 22:08:10