从.csv 中读取 URL 并在前面使用 Python、BeautifulSoup、Z251D2BBFE9A3DC6954AZ6FE9A3DC698B4AZ3

Question

I got this code to almost work, despite much ignorance.尽管有很多无知，我还是让这段代码几乎可以工作。 Please help on the home run!请帮助本垒打！

Problem 1: INPUT:问题1：输入：

I have a long list of URLs (1000+) to read from and they are in a single column in.csv.我有一长串要读取的 URL (1000+)，它们位于.csv 的单列中。 I would prefer to read from that file than to paste them into code, like below.我宁愿从该文件中读取，也不愿将它们粘贴到代码中，如下所示。

Problem 2: OUTPUT:问题2：OUTPUT：

The source files actually have 3 drivers and 3 challenges each.源文件实际上有 3 个驱动程序和 3 个挑战。 In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).在一个单独的 python 文件中，下面的代码找到、打印并保存所有 3 个，但当我使用下面的 dataframe 时（见下文 - 它只保存 2 个）。

Problem 3: OUTPUT:问题3：OUTPUT：

I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns.我希望 output（两个文件）在第 0 列中具有 URL，然后在以下列中具有驱动程序（或挑战）。 But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.但是我在这里写的（可能是“drop”）使它们不仅下降了一行，而且还移动了 2 列。

At the end I'm showing both the inputs and the current & desired output.最后，我同时展示了输入和当前和所需的 output。 Sorry for the long question.对不起，很长的问题。 I'll be very grateful for any help!我将非常感谢任何帮助！

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

The inputs look like this in each URL.每个 URL 中的输入如下所示。 They are just lists:它们只是列表：

Market drivers市场驱动力

Growing investment in fabs对晶圆厂的投资不断增长
Miniaturization of electronic products电子产品小型化
Increasing demand for IoT devices对物联网设备的需求不断增长

Market challenges市场挑战

Rapid technological changes in semiconductor industry半导体行业的快速技术变革
Volatility in semiconductor industry半导体行业的波动
Impact of technology chasm Table Impact of drivers and challenges技术鸿沟的影响表驱动因素和挑战的影响

My desired output for drivers is:我想要的驱动程序 output 是：

0 0	1 1	2 2	3 3
http/.../Global-Induction-Hobs-30196623/ http/.../Global-Induction-Hobs-30196623/	Product innovations and new designs产品创新和新设计	Increasing demand for convenient home appliances with changes in lifestyle patterns随着生活方式的改变，对便利家电的需求不断增加	Growing adoption of energy-efficient appliances越来越多地采用节能电器
http/.../Global-Human-Capital-Management-30196628/ http/.../Global-Human-Capital-Management-30196628/	Demand for automated recruitment processes对自动化招聘流程的需求	Increasing demand for unified solutions for all HR functions对所有 HR 职能统一解决方案的需求不断增长	Increasing workforce diversity增加劳动力多样性
http/.../Global-Probe-Card-30196643/ http/.../Global-Probe-Card-30196643/	Growing investment in fabs对晶圆厂的投资不断增长	Miniaturization of electronic products电子产品小型化	Increasing demand for IoT devices对物联网设备的需求不断增长

But instead I get:但相反，我得到：

0 0	1 1	2 2	3 3	4 4	5 5	6 6
http/.../Global-Induction-Hobs-30196623/ http/.../Global-Induction-Hobs-30196623/	Increasing demand for convenient home appliances with changes in lifestyle patterns随着生活方式的改变，对便利家电的需求不断增加	Growing adoption of energy-efficient appliances越来越多地采用节能电器
http/.../Global-Human-Capital-Management-30196628/ http/.../Global-Human-Capital-Management-30196628/			Increasing demand for unified solutions for all HR functions对所有 HR 职能统一解决方案的需求不断增长	Increasing workforce diversity增加劳动力多样性
http/.../Global-Probe-Card-30196643/ http/.../Global-Probe-Card-30196643/					Miniaturization of electronic products电子产品小型化	Increasing demand for IoT devices对物联网设备的需求不断增长

Answer 1

Store your data in a list of dicts, create a data frame from it.将您的数据存储在字典列表中，从中创建一个数据框。 Split the list of drivers / challenges into single columns and concat it to the final data frame.将drivers / challenges列表拆分为columns ，并将其连接到最终数据帧。

Example例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") li') if x.get_text(strip=True) != 'Table Impact of drivers and challenges' ]
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)

Output Output

url url	type类型	0 0	1 1	2 2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	driver司机	Product innovations and new designs产品创新和新设计	Increasing demand for convenient home appliances with changes in lifestyle patterns随着生活方式的改变，对便利家电的需求不断增加	Growing adoption of energy-efficient appliances越来越多地采用节能电器
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	challenges挑战	High cost limiting the adoption in the mass segment高成本限制了大众市场的采用	Health hazards related to induction hobs与电磁炉相关的健康危害	Limitation of using only flat - surface utensils and induction-specific cookwareTable Impact of drivers and challenges仅使用平面器具和感应专用炊具的限制表驱动因素和挑战的影响
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	driver司机	Demand for automated recruitment processes对自动化招聘流程的需求	Increasing demand for unified solutions for all HR functions对所有 HR 职能统一解决方案的需求不断增长	Increasing workforce diversity增加劳动力多样性
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	challenges挑战	Threat from open-source software来自开源软件的威胁	High implementation and maintenance cost实施和维护成本高	Threat to data securityTable Impact of drivers and challenges对数据安全的威胁表驱动因素和挑战的影响
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	driver司机	Growing investment in fabs对晶圆厂的投资不断增长	Miniaturization of electronic products电子产品小型化	Increasing demand for IoT devices对物联网设备的需求不断增长
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	challenges挑战	Rapid technological changes in semiconductor industry半导体行业的快速技术变革	Volatility in semiconductor industry半导体行业的波动	Impact of technology chasmTable Impact of drivers and challenges技术鸿沟的影响表驱动因素和挑战的影响

从.csv 中读取 URL 并在前面使用 Python、BeautifulSoup、Z251D2BBFE9A3DC6954AZ6FE9A3DC698B4AZ3

问题描述

1 个解决方案

解决方案1
1 2021-11-27 20:23:53

Example例子

Output Output

从.csv 中读取 URL 并在前面使用 Python、BeautifulSoup、Z251D2BBFE9A3DC6954AZ6FE9A3DC698B4AZ3

问题描述

1 个解决方案

解决方案1 1 2021-11-27 20:23:53

Example例子

Output Output

解决方案1
1 2021-11-27 20:23:53