简体   繁体   English

从.csv 中读取 URL 并在前面使用 Python、BeautifulSoup、Z251D2BBFE9A3DC6954AZ6FE9A3DC698B4AZ3

[英]Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

I got this code to almost work, despite much ignorance.尽管有很多无知,我还是让这段代码几乎可以工作。 Please help on the home run!请帮助本垒打!

  • Problem 1: INPUT:问题1:输入:

I have a long list of URLs (1000+) to read from and they are in a single column in.csv.我有一长串要读取的 URL (1000+),它们位于.csv 的单列中。 I would prefer to read from that file than to paste them into code, like below.我宁愿从该文件中读取,也不愿将它们粘贴到代码中,如下所示。

  • Problem 2: OUTPUT:问题2:OUTPUT:

The source files actually have 3 drivers and 3 challenges each.源文件实际上有 3 个驱动程序和 3 个挑战。 In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).在一个单独的 python 文件中,下面的代码找到、打印并保存所有 3 个,但当我使用下面的 dataframe 时(见下文 - 它只保存 2 个)。

  • Problem 3: OUTPUT:问题3:OUTPUT:

I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns.我希望 output(两个文件)在第 0 列中具有 URL,然后在以下列中具有驱动程序(或挑战)。 But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.但是我在这里写的(可能是“drop”)使它们不仅下降了一行,而且还移动了 2 列。

At the end I'm showing both the inputs and the current & desired output.最后,我同时展示了输入和当前和所需的 output。 Sorry for the long question.对不起,很长的问题。 I'll be very grateful for any help!我将非常感谢任何帮助!

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

The inputs look like this in each URL.每个 URL 中的输入如下所示。 They are just lists:它们只是列表:

Market drivers市场驱动力

  • Growing investment in fabs对晶圆厂的投资不断增长
  • Miniaturization of electronic products电子产品小型化
  • Increasing demand for IoT devices对物联网设备的需求不断增长

Market challenges市场挑战

  • Rapid technological changes in semiconductor industry半导体行业的快速技术变革
  • Volatility in semiconductor industry半导体行业的波动
  • Impact of technology chasm Table Impact of drivers and challenges技术鸿沟的影响 表 驱动因素和挑战的影响

My desired output for drivers is:我想要的驱动程序 output 是:

0 0 1 1 2 2 3 3
http/.../Global-Induction-Hobs-30196623/ http/.../Global-Induction-Hobs-30196623/ Product innovations and new designs产品创新和新设计 Increasing demand for convenient home appliances with changes in lifestyle patterns随着生活方式的改变,对便利家电的需求不断增加 Growing adoption of energy-efficient appliances越来越多地采用节能电器
http/.../Global-Human-Capital-Management-30196628/ http/.../Global-Human-Capital-Management-30196628/ Demand for automated recruitment processes对自动化招聘流程的需求 Increasing demand for unified solutions for all HR functions对所有 HR 职能统一解决方案的需求不断增长 Increasing workforce diversity增加劳动力多样性
http/.../Global-Probe-Card-30196643/ http/.../Global-Probe-Card-30196643/ Growing investment in fabs对晶圆厂的投资不断增长 Miniaturization of electronic products电子产品小型化 Increasing demand for IoT devices对物联网设备的需求不断增长

But instead I get:但相反,我得到:

0 0 1 1 2 2 3 3 4 4 5 5 6 6
http/.../Global-Induction-Hobs-30196623/ http/.../Global-Induction-Hobs-30196623/ Increasing demand for convenient home appliances with changes in lifestyle patterns随着生活方式的改变,对便利家电的需求不断增加 Growing adoption of energy-efficient appliances越来越多地采用节能电器
http/.../Global-Human-Capital-Management-30196628/ http/.../Global-Human-Capital-Management-30196628/ Increasing demand for unified solutions for all HR functions对所有 HR 职能统一解决方案的需求不断增长 Increasing workforce diversity增加劳动力多样性
http/.../Global-Probe-Card-30196643/ http/.../Global-Probe-Card-30196643/ Miniaturization of electronic products电子产品小型化 Increasing demand for IoT devices对物联网设备的需求不断增长

Store your data in a list of dicts, create a data frame from it.将您的数据存储在字典列表中,从中创建一个数据框。 Split the list of drivers / challenges into single columns and concat it to the final data frame.drivers / challenges列表拆分为columns ,并将其连接到最终数据帧。

Example例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") li') if x.get_text(strip=True) != 'Table Impact of drivers and challenges' ]
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)

Output Output

url url type类型 0 0 1 1 2 2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ driver司机 Product innovations and new designs产品创新和新设计 Increasing demand for convenient home appliances with changes in lifestyle patterns随着生活方式的改变,对便利家电的需求不断增加 Growing adoption of energy-efficient appliances越来越多地采用节能电器
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ challenges挑战 High cost limiting the adoption in the mass segment高成本限制了大众市场的采用 Health hazards related to induction hobs与电磁炉相关的健康危害 Limitation of using only flat - surface utensils and induction-specific cookwareTable Impact of drivers and challenges仅使用平面器具和感应专用炊具的限制表 驱动因素和挑战的影响
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ driver司机 Demand for automated recruitment processes对自动化招聘流程的需求 Increasing demand for unified solutions for all HR functions对所有 HR 职能统一解决方案的需求不断增长 Increasing workforce diversity增加劳动力多样性
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ challenges挑战 Threat from open-source software来自开源软件的威胁 High implementation and maintenance cost实施和维护成本高 Threat to data securityTable Impact of drivers and challenges对数据安全的威胁表驱动因素和挑战的影响
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ driver司机 Growing investment in fabs对晶圆厂的投资不断增长 Miniaturization of electronic products电子产品小型化 Increasing demand for IoT devices对物联网设备的需求不断增长
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ challenges挑战 Rapid technological changes in semiconductor industry半导体行业的快速技术变革 Volatility in semiconductor industry半导体行业的波动 Impact of technology chasmTable Impact of drivers and challenges技术鸿沟的影响表驱动因素和挑战的影响

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取来自.csv 的 URL 列表,用于使用 Python、BeautifulSoup、Z251D2BBFE9A3B95EAZCE5696 进行抓取 - Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas 从 Python 和 BeautifulSoup 中的搜索结果中抓取网址 - Scrape urls from search results in Python and BeautifulSoup Output scrape results into multiple.csv files, with python, BeautifulSoup, pandas? - Output scrape results into multiple .csv files, with python, BeautifulSoup, pandas? 使用 BeautifulSoup 从 CSV 中列出的多个 URL 中抓取信息,然后将这些结果导出到新的 CSV 文件 - Scrape information from multiple URLs listed in a CSV using BeautifulSoup and then export these results to a new CSV file 在Python 3中使用BeautifulSoup抓取网址 - Scrape URLs using BeautifulSoup in Python 3 Python - 使用BeautifulSoup从URL列表中删除文本的最简单方法 - Python - Easiest way to scrape text from list of URLs using BeautifulSoup Python BeautifulSoup-使用给定URL中的iframe抓取多个网页 - Python BeautifulSoup - Scrape Multiple Web Pages with Iframes from Given URLs 从多个 URL 抓取多个页面上的表数据(Python 和 BeautifulSoup) - Scrape Table Data on Multiple Pages from Multiple URLs (Python & BeautifulSoup) Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv - Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv 使用 BeautifulSoup 并从 CSV 读取目标 URL 的问题 - Issue using BeautifulSoup and reading target URLs from a CSV
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM