![](/img/trans.png)
[英]How to extract paragraph text in python using lxml from html file?
[英]how to extract description from HTML paragraph using Python
我想從 HTML 源中提取 HTML 段落。 但它正在獲取顏色和 id 的數據。
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com/gb/t/air-max-viva-shoe-ZQTSV8/DB5268-003"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
description = soup.find(
'div', {'class': 'description-preview body-2 css-1pbvugb'}).text
print(description)
只需在它之后使用.find p。
description = soup.find('div', {'class':'description-preview body-2 css-1pbvugb'}).find("p").text
看來您想要下一個<p>
的文本:
description = soup.find('div', {'class':'description-preview body-2 css-1pbvugb'}).find_next('p').text
如果這是鏈接中的唯一目標,那么在這種情況下您不需要使用真正的解析器,因為這將加載cache
memory 中的所有內容。
您可以使用regex
或bs4
解析器比較操作時間。
下面是一個快速捕獲:
import re
import requests
r = requests.get(
'https://www.nike.com/gb/t/air-max-viva-shoe-ZQTSV8/DB5268-003')
match = re.search(r'descriptionPreview\":\"(.+?)\"', r.text).group(1)
print(match)
Output:
Designed with every woman in mind, the mixed material upper of the Nike Air Max Viva
features a plush collar, detailed patterning and intricate stitching. The new lacing
system uses 2 separate laces constructed from heavy-duty tech chord, letting you find the perfect fit. Mixing comfort with style, it combines Nike Air with a lifted foam
heel for and unbelievable ride that looks as good as it feels.
如果您想使用bs4
:
這是一個簡短的用法:
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('.description-preview').p.string)
注意:我使用
lxml
解析器,因為它是根據bs4-documentation最快的解析器
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.