[英]Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas
This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. This was part of another question (see Reading URLs from.csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr.
In the code below, I'm just pasting 3 example source URLs into the code.在下面的代码中,我只是将 3 个示例源 URL 粘贴到代码中。 But I have a long list of URLs (1000+) to scrape and they are stored in a single first column of a.csv file (let's call it 'urllist.csv').
但是我有一长串要抓取的 URL(1000+),它们存储在 a.csv 文件的第一列(我们称之为“urllist.csv”)。 I would prefer to read from that file.
我宁愿从那个文件中读取。
I think I know the basic structure of 'while open', but I'm having problems how to link that to the rest of the code.我想我知道“打开时”的基本结构,但我在如何将其链接到代码的 rest 时遇到问题。 Your help will be highly appreciated.
您的帮助将不胜感激。
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url': url,
'type': 'driver',
'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url': url,
'type': 'challenges',
'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
axis=1).to_csv('output.csv')
Since you're using pandas, read_csv
will do the trick for you: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html由于您使用的是 pandas,
read_csv
将为您解决问题: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.ZFC35FDC70D5FC69D2698ZA822E
If you want to write it on your own, you could use the built in csv library如果你想自己写,你可以使用内置的 csv 库
with open('urls.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row["url"])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.