[英]Web Scraping no table HTML content element into Pandas Table
我需要抓取一個具有類似段落的“表格”的網站,並且我想將其放入 python 上的 pandas 表格中。 這是網站鏈接:' 網站鏈接
我需要獲取頁面的名稱、價格和描述,並將其全部放入 DataFrame 格式。 問題是我可以單獨刮掉所有這些,但我無法讓它們找到合適的 DataFrame。
這是我到目前為止所做的:
I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
#HTML.status_code
Booti= soup(HTML.content, "lxml")
knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
for items in knife_items:
for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
product_links.append(baseURL + links['href'])
然后我在這里刮掉各個 web 頁面:
Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
HTML2 = requests.get(links, HEADER)
Booti2 = soup(HTML2.content,"html.parser")
try:
for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
Name.append(N.text.replace('\n', '').strip())
for P in Booti2.findAll('span',{'class': "price" }):
Price.append(P.text.replace('\n', '').strip())
Contents = Booti2.find('div',class_= "rte text--pull")
for S in Contents.find_all('span'):
Specific.append(S.text)
except:
continue
所以我需要以這種格式獲取所有信息:
Name. | | Price || Model Number Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 2 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 3 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 4 | | $$ || XXXX | ABC. || XX"/XXcm. |
...依此類推,使用 web 頁面中的列的 rest。 任何幫助,將不勝感激!! 太感謝了!!
一種選擇是使用 class rte rte text--pull"
中的find('p')
,然后使用帶有分隔符的get_text
作為參數( \n
)。然后,使用以下正則表達式(或拆分text
變量,查找關鍵字並從字符串中刪除)以僅獲取所需的信息。使用列表rows
,您可以使用pd.DataFrame(rows)
創建 dataframe 。
import re # import regex to get knife model and length
rows = [] # create list to hold dataframe rows
for links in product_links:
HTML2 = requests.get(links)
Booti2 = soup(HTML2.content,"html.parser")
try:
name = Booti2.find('h1',{'class': "product-meta__title heading h1" }).get_text()
price = Booti2.find('span',{'class': "price" }).get_text()
Contents = Booti2.find('div',class_= "rte text--pull")
text = Contents.find('p').get_text(separator='\n')
model_num = re.search('Model Number: (.+?)\n', text).group(1)
model_name = re.search('Model Name: (.+?)\n', text).group(1)
overall_len = re.search('Overall Length: (.+?)\n', text).group(1)
rows.append([name, price, model_num, model_name, overall_len])
except:
continue
如果您還沒有完成, import pandas as pd
。
import pandas as pd
df = pd.DataFrame(rows, columns=['name', 'price', 'model_num', 'model_name', 'overall_len'])
print(df)
name price model_num model_name overall_len
0 CIVIVI Altus Button Lock a... $85 C20076-1 Altus 7.12" / 180.8mm
1 CIVIVI Altus Button Lock a... $90 C20076-3 Altus 7.12" / 180.8mm
2 CIVIVI Altus Button Lock a... $107 C20076-DS1 Altus 7.12" / 180.8mm
3 CIVIVI Teton Tickler Fixed... $258.50 C20072-1 Teton Tickler 10.16" / 258mm
4 CIVIVI Nox Flipper Knife G... $76.50 C2110C NOx 6.80" / 172.7mm
...
...
40 CIVIVI Ortis Flipper Knife... $105 C2013DS-1 Ortis 7.48" / 190mm
41 CIVIVI Dogma Flipper Knife... $79.50 C2014A Dogma 7.7" / 195.7mm
42 CIVIVI Dogma Flipper Knife... $79.50 C2014B Dogma 7.7" / 195.7mm
43 CIVIVI Appalachian Drifter... $98 C2015A Appalachian Drifter 6.8" / 172.7mm
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.