簡體   English   中英

Web 刮取無表 HTML 內容元素到 Pandas 表中

[英]Web Scraping no table HTML content element into Pandas Table

我需要抓取一個具有類似段落的“表格”的網站,並且我想將其放入 python 上的 pandas 表格中。 這是網站鏈接:' 網站鏈接

我需要獲取頁面的名稱、價格和描述,並將其全部放入 DataFrame 格式。 問題是我可以單獨刮掉所有這些,但我無法讓它們找到合適的 DataFrame。

這是我到目前為止所做的:

I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
    HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
    #HTML.status_code
    Booti= soup(HTML.content, "lxml")
    knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
    
    for items in knife_items:
        for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
            product_links.append(baseURL + links['href'])

然后我在這里刮掉各個 web 頁面:

Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
    HTML2 = requests.get(links, HEADER)
    Booti2 = soup(HTML2.content,"html.parser") 
    try:
        for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
            Name.append(N.text.replace('\n', '').strip())
        for P in Booti2.findAll('span',{'class': "price" }):
            Price.append(P.text.replace('\n', '').strip())
        Contents = Booti2.find('div',class_= "rte text--pull")
        for S in Contents.find_all('span'):
            Specific.append(S.text)

    except:
        continue 

所以我需要以這種格式獲取所有信息:

         Name.     | | Price          || Model Number  Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |  
| Product Name 2   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |
| Product Name 3   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    | 
| Product Name 4   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |

...依此類推,使用 web 頁面中的列的 rest。 任何幫助,將不勝感激!! 太感謝了!!

一種選擇是使用 class rte rte text--pull"中的find('p') ,然后使用帶有分隔符的get_text作為參數( \n )。然后,使用以下正則表達式(或拆分text變量,查找關鍵字並從字符串中刪除)以僅獲取所需的信息。使用列表rows ,您可以使用pd.DataFrame(rows)創建 dataframe 。

import re # import regex to get knife model and length
rows = [] # create list to hold dataframe rows

for links in product_links:
    HTML2 = requests.get(links)
    Booti2 = soup(HTML2.content,"html.parser")
    try:
        name = Booti2.find('h1',{'class': "product-meta__title heading h1" }).get_text()
        price =  Booti2.find('span',{'class': "price" }).get_text()
        Contents = Booti2.find('div',class_= "rte text--pull")
        text = Contents.find('p').get_text(separator='\n')
        model_num = re.search('Model Number: (.+?)\n', text).group(1)
        model_name = re.search('Model Name: (.+?)\n', text).group(1)
        overall_len = re.search('Overall Length: (.+?)\n', text).group(1)
        rows.append([name, price, model_num, model_name, overall_len])
    except:
        continue

如果您還沒有完成, import pandas as pd

import pandas as pd
df = pd.DataFrame(rows, columns=['name', 'price', 'model_num', 'model_name', 'overall_len'])
print(df)
                            name    price    model_num              model_name      overall_len
0   CIVIVI Altus Button Lock a...      $85     C20076-1                   Altus  7.12" / 180.8mm
1   CIVIVI Altus Button Lock a...      $90     C20076-3                   Altus  7.12" / 180.8mm
2   CIVIVI Altus Button Lock a...     $107   C20076-DS1                   Altus  7.12" / 180.8mm
3   CIVIVI Teton Tickler Fixed...  $258.50     C20072-1           Teton Tickler   10.16" / 258mm
4   CIVIVI Nox Flipper Knife G...   $76.50       C2110C                     NOx  6.80" / 172.7mm
...
...
40  CIVIVI Ortis Flipper Knife...     $105    C2013DS-1                   Ortis    7.48" / 190mm
41  CIVIVI Dogma Flipper Knife...   $79.50       C2014A                   Dogma   7.7" / 195.7mm
42  CIVIVI Dogma Flipper Knife...   $79.50       C2014B                   Dogma   7.7" / 195.7mm
43  CIVIVI Appalachian Drifter...      $98       C2015A     Appalachian Drifter   6.8" / 172.7mm

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM