Web 刮取無表 HTML 內容元素到 Pandas 表中

Question

我需要抓取一個具有類似段落的“表格”的網站，並且我想將其放入 python 上的 pandas 表格中。 這是網站鏈接：' 網站鏈接

我需要獲取頁面的名稱、價格和描述，並將其全部放入 DataFrame 格式。 問題是我可以單獨刮掉所有這些，但我無法讓它們找到合適的 DataFrame。

這是我到目前為止所做的：

I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
    HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
    #HTML.status_code
    Booti= soup(HTML.content, "lxml")
    knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
    
    for items in knife_items:
        for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
            product_links.append(baseURL + links['href'])

然后我在這里刮掉各個 web 頁面：

Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
    HTML2 = requests.get(links, HEADER)
    Booti2 = soup(HTML2.content,"html.parser") 
    try:
        for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
            Name.append(N.text.replace('\n', '').strip())
        for P in Booti2.findAll('span',{'class': "price" }):
            Price.append(P.text.replace('\n', '').strip())
        Contents = Booti2.find('div',class_= "rte text--pull")
        for S in Contents.find_all('span'):
            Specific.append(S.text)

    except:
        continue

所以我需要以這種格式獲取所有信息：

         Name.     | | Price          || Model Number  Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |  
| Product Name 2   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |
| Product Name 3   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    | 
| Product Name 4   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |

...依此類推，使用 web 頁面中的列的 rest。 任何幫助，將不勝感激！！ 太感謝了！！

Answer 1

一種選擇是使用 class rte rte text--pull"中的find('p') ，然后使用帶有分隔符的get_text作為參數（ \n ）。然后，使用以下正則表達式（或拆分text變量，查找關鍵字並從字符串中刪除）以僅獲取所需的信息。使用列表rows ，您可以使用pd.DataFrame(rows)創建 dataframe 。

import re # import regex to get knife model and length
rows = [] # create list to hold dataframe rows

for links in product_links:
    HTML2 = requests.get(links)
    Booti2 = soup(HTML2.content,"html.parser")
    try:
        name = Booti2.find('h1',{'class': "product-meta__title heading h1" }).get_text()
        price =  Booti2.find('span',{'class': "price" }).get_text()
        Contents = Booti2.find('div',class_= "rte text--pull")
        text = Contents.find('p').get_text(separator='\n')
        model_num = re.search('Model Number: (.+?)\n', text).group(1)
        model_name = re.search('Model Name: (.+?)\n', text).group(1)
        overall_len = re.search('Overall Length: (.+?)\n', text).group(1)
        rows.append([name, price, model_num, model_name, overall_len])
    except:
        continue

如果您還沒有完成， import pandas as pd 。

import pandas as pd
df = pd.DataFrame(rows, columns=['name', 'price', 'model_num', 'model_name', 'overall_len'])
print(df)

                            name    price    model_num              model_name      overall_len
0   CIVIVI Altus Button Lock a...      $85     C20076-1                   Altus  7.12" / 180.8mm
1   CIVIVI Altus Button Lock a...      $90     C20076-3                   Altus  7.12" / 180.8mm
2   CIVIVI Altus Button Lock a...     $107   C20076-DS1                   Altus  7.12" / 180.8mm
3   CIVIVI Teton Tickler Fixed...  $258.50     C20072-1           Teton Tickler   10.16" / 258mm
4   CIVIVI Nox Flipper Knife G...   $76.50       C2110C                     NOx  6.80" / 172.7mm
...
...
40  CIVIVI Ortis Flipper Knife...     $105    C2013DS-1                   Ortis    7.48" / 190mm
41  CIVIVI Dogma Flipper Knife...   $79.50       C2014A                   Dogma   7.7" / 195.7mm
42  CIVIVI Dogma Flipper Knife...   $79.50       C2014B                   Dogma   7.7" / 195.7mm
43  CIVIVI Appalachian Drifter...      $98       C2015A     Appalachian Drifter   6.8" / 172.7mm

Web 刮取無表 HTML 內容元素到 Pandas 表中

問題描述

1 個解決方案

解決方案1
0 2022-01-29 21:10:33

Web 刮取無表 HTML 內容元素到 Pandas 表中

問題描述

1 個解決方案

解決方案1 0 2022-01-29 21:10:33

解決方案1
0 2022-01-29 21:10:33