无法在 Scrapy 中抓取网页的描述

Question

我想使用 Scrapy 抓取数据，这是链接

https://bbdealz.com/product/1000pcs-jigsaw-puzzle-7550cm-with-storage-bag-wooden-paper-puzzles-educational-toys-for-children-bedroom-decoration-stickers/

我使用此代码获取描述

'description': response.css('#tab-description p::text').extract(),

但回应是

description': ['    ', 'None  ', '  ', 'Unisex  ', '  ', '12-15 Years  ', '  ', 'Grownups  ', '  ', 'Paper  ', '  ', 'Landscape  ', '  ', 'SMW783   ']

它忽略<strong>和<br>标签。

我需要这样的输出

<p>    <strong>Brand Name: </strong>None  <br>  <strong>Gender: </strong>Unisex  <br>  <strong>Age Range: </strong>12-15 Years  <br>  <strong>Age Range: </strong>Grownups  <br>  <strong>Material: </strong>Paper  <br>  <strong>Style: </strong>Landscape  <br>  <strong>Model Number: </strong>SMW783   </p>

Answer 1

您可以尝试使用 xpath：

我对此进行了测试，它似乎有效：

for element in response.xpath("//div[@id='tab-description']/p"):
    values = element.xpath("./text()").getall()
    labels = element.xpath("./strong/text()").getall()
    values = [i for i in values if i.strip()]
    data = {labels[i]:values[i] for i in range(len(labels))}
    print(data)

输出： {'Brand Name: ': 'None ', 'Gender: ': 'Unisex ', 'Age Range: ': 'Grownups ', 'Material: ': 'Paper ', 'Style: ': 'Landscape ', 'Model Number: ': 'SMW783 '}

这只是html：

a = response.xpath("//div[@id='tab-description']/p").get()
print(a)

输出

<p>    <strong>Brand Name: </strong>None  <br>  <strong>Gender: </strong>Unisex  <br>  <strong>Age Range: </strong>12-15 Years  <br>  <strong>Age Range: </strong>Grownups  <br>  <strong>Material: </strong>Paper  <br>  <strong>S
tyle: </strong>Landscape  <br>  <strong>Model Number: </strong>SMW783   </p>

Answer 2

当 XPath 中的 /text() 或 CSS 中的 ::text 无法产生所需的结果时，我使用另一个库。

安装它。

pip3 install html2text

例子

from html2text import HTML2Text
h = HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True

#Inside the scrapy project
'description': h.handle(response.css('#tab-description p').get()).strip()

yield ....

Answer 3

我试过了，它工作

import scrapy
from bs4 import BeautifulSoup

class Google(scrapy.Spider):
    name = 'google'
    start_urls = ['https://bbdealz.com/product/funny-sports-game-2m-3m-4m-5m-6m-diameter-outdoor-rainbow-umbrella-parachute-toy-jump-sack-ballute-play-game-mat-toy-kids-gift/',]
    def parse(self, response):
        soup = BeautifulSoup(response.body, 'html.parser')
        yield {
        'title': response.css('h1::text').get(),
        'description': soup.select_one('#tab-description').select_one('p'),
        }

问候

无法在 Scrapy 中抓取网页的描述

问题描述

3 个解决方案

解决方案1
0 2022-07-14 23:10:17

这只是html：

解决方案2
0 2022-07-16 00:10:15

解决方案3
0 2022-07-16 05:24:01

无法在 Scrapy 中抓取网页的描述

问题描述

3 个解决方案

解决方案1 0 2022-07-14 23:10:17

这只是html：

解决方案2 0 2022-07-16 00:10:15

解决方案3 0 2022-07-16 05:24:01

解决方案1
0 2022-07-14 23:10:17

解决方案2
0 2022-07-16 00:10:15

解决方案3
0 2022-07-16 05:24:01