[英]Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas
I got this code to almost work, despite much ignorance.尽管有很多无知,我还是让这段代码几乎可以工作。 Please help on the home run!
请帮助本垒打!
I have a long list of URLs (1000+) to read from and they are in a single column in.csv.我有一长串要读取的 URL (1000+),它们位于.csv 的单列中。 I would prefer to read from that file than to paste them into code, like below.
我宁愿从该文件中读取,也不愿将它们粘贴到代码中,如下所示。
The source files actually have 3 drivers and 3 challenges each.源文件实际上有 3 个驱动程序和 3 个挑战。 In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).
在一个单独的 python 文件中,下面的代码找到、打印并保存所有 3 个,但当我使用下面的 dataframe 时(见下文 - 它只保存 2 个)。
I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns.我希望 output(两个文件)在第 0 列中具有 URL,然后在以下列中具有驱动程序(或挑战)。 But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.
但是我在这里写的(可能是“drop”)使它们不仅下降了一行,而且还移动了 2 列。
At the end I'm showing both the inputs and the current & desired output.最后,我同时展示了输入和当前和所需的 output。 Sorry for the long question.
对不起,很长的问题。 I'll be very grateful for any help!
我将非常感谢任何帮助!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
The inputs look like this in each URL.每个 URL 中的输入如下所示。 They are just lists:
它们只是列表:
Market drivers市场驱动力
Market challenges市场挑战
My desired output for drivers is:我想要的驱动程序 output 是:
0 ![]() |
1 ![]() |
2 ![]() |
3 ![]() |
---|---|---|---|
http/.../Global-Induction-Hobs-30196623/ ![]() |
Product innovations and new designs![]() |
Increasing demand for convenient home appliances with changes in lifestyle patterns![]() |
Growing adoption of energy-efficient appliances![]() |
http/.../Global-Human-Capital-Management-30196628/ ![]() |
Demand for automated recruitment processes![]() |
Increasing demand for unified solutions for all HR functions![]() |
Increasing workforce diversity![]() |
http/.../Global-Probe-Card-30196643/ ![]() |
Growing investment in fabs![]() |
Miniaturization of electronic products![]() |
Increasing demand for IoT devices![]() |
But instead I get:但相反,我得到:
0 ![]() |
1 ![]() |
2 ![]() |
3 ![]() |
4 ![]() |
5 ![]() |
6 ![]() |
---|---|---|---|---|---|---|
http/.../Global-Induction-Hobs-30196623/ ![]() |
Increasing demand for convenient home appliances with changes in lifestyle patterns![]() |
Growing adoption of energy-efficient appliances![]() |
||||
http/.../Global-Human-Capital-Management-30196628/ ![]() |
Increasing demand for unified solutions for all HR functions![]() |
Increasing workforce diversity![]() |
||||
http/.../Global-Probe-Card-30196643/ ![]() |
Miniaturization of electronic products![]() |
Increasing demand for IoT devices![]() |
Store your data in a list of dicts, create a data frame from it.将您的数据存储在字典列表中,从中创建一个数据框。 Split the list of
drivers
/ challenges
into single columns
and concat it to the final data frame.将
drivers
/ challenges
列表拆分为columns
,并将其连接到最终数据帧。
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") li') if x.get_text(strip=True) != 'Table Impact of drivers and challenges' ]
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)
url ![]() |
type![]() |
0 ![]() |
1 ![]() |
2 ![]() |
---|---|---|---|---|
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ ![]() |
driver![]() |
Product innovations and new designs![]() |
Increasing demand for convenient home appliances with changes in lifestyle patterns![]() |
Growing adoption of energy-efficient appliances![]() |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ ![]() |
challenges![]() |
High cost limiting the adoption in the mass segment![]() |
Health hazards related to induction hobs![]() |
Limitation of using only flat - surface utensils and induction-specific cookwareTable Impact of drivers and challenges![]() |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ ![]() |
driver![]() |
Demand for automated recruitment processes![]() |
Increasing demand for unified solutions for all HR functions![]() |
Increasing workforce diversity![]() |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ ![]() |
challenges![]() |
Threat from open-source software![]() |
High implementation and maintenance cost![]() |
Threat to data securityTable Impact of drivers and challenges![]() |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ ![]() |
driver![]() |
Growing investment in fabs![]() |
Miniaturization of electronic products![]() |
Increasing demand for IoT devices![]() |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ ![]() |
challenges![]() |
Rapid technological changes in semiconductor industry![]() |
Volatility in semiconductor industry![]() |
Impact of technology chasmTable Impact of drivers and challenges![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.