繁体   English   中英

读取来自.csv 的 URL 列表,用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取

[英]Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas

This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr.

在下面的代码中,我只是将 3 个示例源 URL 粘贴到代码中。 但是我有一长串要抓取的 URL(1000+),它们存储在 a.csv 文件的第一列(我们称之为“urllist.csv”)。 我宁愿从那个文件中读取。

我想我知道“打开时”的基本结构,但我在如何将其链接到代码的 rest 时遇到问题。 您的帮助将不胜感激。

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

由于您使用的是 pandas, read_csv将为您解决问题: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.ZFC35FDC70D5FC69D2698ZA822E

如果你想自己写,你可以使用内置的 csv 库

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])


暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM