读取来自.csv 的 URL 列表，用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取

Question

This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr.

In the code below, I'm just pasting 3 example source URLs into the code.在下面的代码中，我只是将 3 个示例源 URL 粘贴到代码中。 But I have a long list of URLs (1000+) to scrape and they are stored in a single first column of a.csv file (let's call it 'urllist.csv').但是我有一长串要抓取的 URL（1000+），它们存储在 a.csv 文件的第一列（我们称之为“urllist.csv”）。 I would prefer to read from that file.我宁愿从那个文件中读取。

I think I know the basic structure of 'while open', but I'm having problems how to link that to the rest of the code.我想我知道“打开时”的基本结构，但我在如何将其链接到代码的 rest 时遇到问题。 Your help will be highly appreciated.您的帮助将不胜感激。

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

Answer 1

Since you're using pandas, read_csv will do the trick for you: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html由于您使用的是 pandas， read_csv将为您解决问题： https://pandas.pydata.org/docs/reference/api/pandas.read_csv.ZFC35FDC70D5FC69D2698ZA822E

If you want to write it on your own, you could use the built in csv library如果你想自己写，你可以使用内置的 csv 库

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])

读取来自.csv 的 URL 列表，用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取

问题描述

1 个解决方案

解决方案1
0 2021-11-27 22:08:10

读取来自.csv 的 URL 列表，用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取

问题描述

1 个解决方案

解决方案1 0 2021-11-27 22:08:10

解决方案1
0 2021-11-27 22:08:10