[英]Getting URL from text file using BeautifulSoup
如何从 .txt 文件 BeautifulSoup 获取网址? 我是 web 抓取的新手。 我想做多页刮,我需要从 txt 文件中提取这些页面。
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
urls = r'C:\chromedriver_win32\asin.txt'
url = ('https://www.amazon.com/dp/'+urls)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
stock = soup.find(id='availability').get_text()
stok_kontrol = pd.DataFrame( { 'Url': [url], 'Stok Durumu': [stock] })
stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')
print(stok_kontrol)
这个记事本有 amazon asin 数字。
C:\chromedriver_win32\asin.txt
文件在:
B00004SU18
B07L9178GQ
B01M35N6CZ
如果我正确理解了这个问题,您只需将 ASIN 编号传递给 url 以告诉 BeautifulSoup 刮什么,这只是一个简单的文件操作,然后循环遍历文件以获取数字并将每个数字传递给 ZC40ED03217B20D3CF刮
urls = r'C:\chromedriver_win32\asin.txt'
with open(urls, 'r') as f:
for line in f:
url = ('https://www.amazon.com/dp/'+line)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
stock = soup.find(id='availability').get_text()
stok_kontrol = pd.DataFrame( { 'Url': [url], 'Stok Durumu': [stock] } )
stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')
print(stok_kontrol)
这将获取产品网址以及产品是否有库存。
将该信息打印到控制台,然后
将其保存到文件“stok-kontrol.csv”
测试:Python 3.7.4
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import re
chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
# Gets whether the products in the array, are in stock, from www.amazon.com
# Returns an Array of Dictionaries, with keys ['asin','instock','url']
def IsProductsInStock(array_of_ASINs):
results = []
for asin in array_of_ASINs:
url = 'https://www.amazon.com/dp/'+str(asin)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
stock = soup.find(id='availability').get_text().strip()
isInStock = False
if('In Stock' in stock):
# If 'In Stock' is the text of 'availability' element
isInStock=True
else:
# If Not, extract the number from it, if any, and see if it's in stock.
tmp = re.search(re.compile('[0-9]+'), stock)
if( tmp is not None and int(tmp[0]) > 0):
isInStock = True
results.append({"asin": asin, "instock": isInStock, "url": url})
return results
# Saves the product information to 'toFile'
# Returns a pandas.core.frame.DataFrame object, with the product info ['url', 'instock'] as columns
# inStockDict MUST be either a Dictionary, or a 'list' of Dictionaries with, ['asin','instock','url'] keys
def SaveProductInStockInformation(inStockDict, toFile):
if(isinstance(inStockDict, dict)):
stok_kontrol = pd.DataFrame( { 'Url': [inStockDict['url']], 'Stok Durumu': [inStockDict['instock']] } )
elif(isinstance(inStockDict, list)):
stocksSimple = []
for stock in inStockDict:
stocksSimple.append([stock['url'], stock['instock']])
stok_kontrol = pd.DataFrame(stocksSimple, columns=['Url', 'Stok Durumu'])
else:
raise Exception("inStockDict parm, Must be Either a dictionary, or a 'list' of dictionaries with, ['asin','instock','url'] keys!")
stok_kontrol.to_csv(toFile, encoding='utf-8-sig')
return stok_kontrol
# Get ASINs From File
f = open(r'C:\chromedriver_win32\asin.txt','r')
urls = f.read().split()
f.close()
# Get a list of Dictionaries containing all the products information
stocks = IsProductsInStock(urls)
# Save and Print the ['url', 'instock'] information
print( SaveProductInStockInformation(stocks, 'stok-kontrol.csv') )
# Remove if you need to use the driver later on in the program
driver.close()
结果:(文件'stok-kontrol.csv')
,Url,Stok Durumu
0,https://www.amazon.com/dp/B00004SU18,True
1,https://www.amazon.com/dp/B07L9178GQ,True
2,https://www.amazon.com/dp/B01M35N6CZ,True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.