繁体   English   中英

使用 BeautifulSoup 从文本文件中获取 URL

[英]Getting URL from text file using BeautifulSoup

如何从 .txt 文件 BeautifulSoup 获取网址? 我是 web 抓取的新手。 我想做多页刮,我需要从 txt 文件中提取这些页面。

import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)


urls = r'C:\chromedriver_win32\asin.txt'
url = ('https://www.amazon.com/dp/'+urls)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

stock = soup.find(id='availability').get_text()

stok_kontrol = pd.DataFrame(  {  'Url': [url], 'Stok Durumu': [stock] })
stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')


print(stok_kontrol)

这个记事本有 amazon asin 数字。

C:\chromedriver_win32\asin.txt

文件在:

B00004SU18

B07L9178GQ

B01M35N6CZ

如果我正确理解了这个问题,您只需将 ASIN 编号传递给 url 以告诉 BeautifulSoup 刮什么,这只是一个简单的文件操作,然后循环遍历文件以获取数字并将每个数字传递给 ZC40ED03217B20D3CF刮

urls = r'C:\chromedriver_win32\asin.txt'
with open(urls, 'r') as f:
    for line in f:
        url = ('https://www.amazon.com/dp/'+line)
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'lxml')
        stock = soup.find(id='availability').get_text()
        stok_kontrol = pd.DataFrame(  {  'Url': [url], 'Stok Durumu': [stock]  }  )
        stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')

        print(stok_kontrol)

这将获取产品网址以及产品是否有库存。
将该信息打印到控制台,然后
将其保存到文件“stok-kontrol.csv”

测试:Python 3.7.4

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import re

chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)

# Gets whether the products in the array, are in stock, from  www.amazon.com
# Returns an Array of Dictionaries, with keys ['asin','instock','url']
def IsProductsInStock(array_of_ASINs):
    results = []
    for asin in array_of_ASINs:
        url = 'https://www.amazon.com/dp/'+str(asin)
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'lxml')

        stock = soup.find(id='availability').get_text().strip()

        isInStock = False
        if('In Stock' in stock): 
            # If 'In Stock' is the text of 'availability' element
            isInStock=True
        else: 
            # If Not, extract the number from it, if any, and see if it's in stock.
            tmp = re.search(re.compile('[0-9]+'), stock)
            if( tmp is not None and int(tmp[0]) > 0):
                isInStock = True

        results.append({"asin": asin, "instock": isInStock, "url": url})
    return results

# Saves the product information to 'toFile'
# Returns a pandas.core.frame.DataFrame object, with the product info ['url', 'instock'] as columns
# inStockDict MUST be either a Dictionary, or a 'list' of Dictionaries with, ['asin','instock','url'] keys
def SaveProductInStockInformation(inStockDict, toFile):
    if(isinstance(inStockDict, dict)):
        stok_kontrol = pd.DataFrame(  {  'Url': [inStockDict['url']], 'Stok Durumu': [inStockDict['instock']]  } )
    elif(isinstance(inStockDict, list)):
        stocksSimple = []
        for stock in inStockDict:
            stocksSimple.append([stock['url'], stock['instock']])
        stok_kontrol = pd.DataFrame(stocksSimple, columns=['Url', 'Stok Durumu'])
    else:
        raise Exception("inStockDict parm, Must be Either a dictionary, or a 'list' of dictionaries with, ['asin','instock','url'] keys!")

    stok_kontrol.to_csv(toFile, encoding='utf-8-sig')
    return stok_kontrol

# Get ASINs From File
f = open(r'C:\chromedriver_win32\asin.txt','r')
urls = f.read().split()
f.close()

# Get a list of Dictionaries containing all the products information
stocks = IsProductsInStock(urls)

# Save and Print the ['url', 'instock'] information
print( SaveProductInStockInformation(stocks, 'stok-kontrol.csv') )


# Remove if you need to use the driver later on in the program
driver.close() 

结果:(文件'stok-kontrol.csv')

,Url,Stok Durumu
0,https://www.amazon.com/dp/B00004SU18,True
1,https://www.amazon.com/dp/B07L9178GQ,True
2,https://www.amazon.com/dp/B01M35N6CZ,True

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM