简体   繁体   English

读取来自.csv 的 URL 列表,用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取

[英]Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas

This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr.

In the code below, I'm just pasting 3 example source URLs into the code.在下面的代码中,我只是将 3 个示例源 URL 粘贴到代码中。 But I have a long list of URLs (1000+) to scrape and they are stored in a single first column of a.csv file (let's call it 'urllist.csv').但是我有一长串要抓取的 URL(1000+),它们存储在 a.csv 文件的第一列(我们称之为“urllist.csv”)。 I would prefer to read from that file.我宁愿从那个文件中读取。

I think I know the basic structure of 'while open', but I'm having problems how to link that to the rest of the code.我想我知道“打开时”的基本结构,但我在如何将其链接到代码的 rest 时遇到问题。 Your help will be highly appreciated.您的帮助将不胜感激。

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

Since you're using pandas, read_csv will do the trick for you: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html由于您使用的是 pandas, read_csv将为您解决问题: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.ZFC35FDC70D5FC69D2698ZA822E

If you want to write it on your own, you could use the built in csv library如果你想自己写,你可以使用内置的 csv 库

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 BeautifulSoup:抓取 CSV 的 URL 列表 - BeautifulSoup: Scraping CSV list of URLs 从.csv 中读取 URL 并在前面使用 Python、BeautifulSoup、Z251D2BBFE9A3DC6954AZ6FE9A3DC698B4AZ3 - Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv - Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv 使用 beautifulsoup 抓取 url 列表并将数据转换为 csv - Scraping a list of urls using beautifulsoup and convert data to csv Python BeautifulSoup到csv抓取 - Python BeautifulSoup to csv scraping InvalidArgumentException:使用从 CSV 读取的 Selenium 和 Pandas 抓取 url 的无效参数错误 - InvalidArgumentException: invalid argument error using Selenium and Pandas scraping urls reading from a CSV 使用 python、csv、beautifulsoup 和 Z251D2BBFE9A3B95E5691CEB30DC6 进行网页抓取和分页 - web-scraping and pagination with python, csv, beautifulsoup and Pandas 使用 BeautifulSoup 并从 CSV 读取目标 URL 的问题 - Issue using BeautifulSoup and reading target URLs from a CSV Python/BeautifulSoup 抓取打印到 csv - Python/BeautifulSoup scraping and printing to csv Python - 网页搜罗 - BeautifulSoup和CSV - Python - Web Scraping - BeautifulSoup & CSV
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM