[英]Unable to scrape description of webpage in Scrapy
我想使用 Scrapy 抓取数据,这是链接
我使用此代码获取描述
'description': response.css('#tab-description p::text').extract(),
但回应是
description': [' ', 'None ', ' ', 'Unisex ', ' ', '12-15 Years ', ' ', 'Grownups ', ' ', 'Paper ', ' ', 'Landscape ', ' ', 'SMW783 ']
它忽略<strong>
和<br>
标签。
我需要这样的输出
<p> <strong>Brand Name: </strong>None <br> <strong>Gender: </strong>Unisex <br> <strong>Age Range: </strong>12-15 Years <br> <strong>Age Range: </strong>Grownups <br> <strong>Material: </strong>Paper <br> <strong>Style: </strong>Landscape <br> <strong>Model Number: </strong>SMW783 </p>
您可以尝试使用 xpath:
我对此进行了测试,它似乎有效:
for element in response.xpath("//div[@id='tab-description']/p"):
values = element.xpath("./text()").getall()
labels = element.xpath("./strong/text()").getall()
values = [i for i in values if i.strip()]
data = {labels[i]:values[i] for i in range(len(labels))}
print(data)
输出: {'Brand Name: ': 'None ', 'Gender: ': 'Unisex ', 'Age Range: ': 'Grownups ', 'Material: ': 'Paper ', 'Style: ': 'Landscape ', 'Model Number: ': 'SMW783 '}
a = response.xpath("//div[@id='tab-description']/p").get()
print(a)
输出
<p> <strong>Brand Name: </strong>None <br> <strong>Gender: </strong>Unisex <br> <strong>Age Range: </strong>12-15 Years <br> <strong>Age Range: </strong>Grownups <br> <strong>Material: </strong>Paper <br> <strong>S
tyle: </strong>Landscape <br> <strong>Model Number: </strong>SMW783 </p>
当 XPath 中的 /text() 或 CSS 中的 ::text 无法产生所需的结果时,我使用另一个库。
安装它。
pip3 install html2text
例子
from html2text import HTML2Text
h = HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True
#Inside the scrapy project
'description': h.handle(response.css('#tab-description p').get()).strip()
yield ....
我试过了,它工作
import scrapy
from bs4 import BeautifulSoup
class Google(scrapy.Spider):
name = 'google'
start_urls = ['https://bbdealz.com/product/funny-sports-game-2m-3m-4m-5m-6m-diameter-outdoor-rainbow-umbrella-parachute-toy-jump-sack-ballute-play-game-mat-toy-kids-gift/',]
def parse(self, response):
soup = BeautifulSoup(response.body, 'html.parser')
yield {
'title': response.css('h1::text').get(),
'description': soup.select_one('#tab-description').select_one('p'),
}
问候
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.