簡體   English   中英

Scrapy:如何從腳本中導出 Json

[英]Scrapy: How to export Json from script

我用 scrapy 創建了一個 web 爬蟲,但我的電話號碼有問題,因為它在腳本中。 腳本是:

<script data-n-head="true" type="application/ld+json">{"@context":"http://schema.org","@type":"LocalBusiness","name":"Clínica Dental Reina Victoria 23","description":".TU CLÍNICA DENTAL DE REFERENCIA EN MADRID","logo":"https://estaticos.qdq.com/CMS/directory/logos/c/l/clinica-dental-reina-victoria.png","image":"https://estaticos.qdq.com/coverphotos/098/535/ed1c5ffcf38241f8b83a1808af51a615.jpg","url":"https://www.clinicadental-reinavictoria.es/","hasMap":"https://www.google.com/maps/search/?api=1&query=40.4469174,-3.7087934","telephone":"+34915340309","address":{"@type":"PostalAddress","streetAddress":"Av. Reina Victoria 23","addressLocality":"MADRID","addressRegion":"Madrid","postalCode":"28003"}}</script>

此腳本在不同頁面中更改,但僅更改電話號碼

我用 Xpath 提取腳本

data = response.xpath('/html/head/script[3]').extract()
        decoded = json.loads(data.telephone("utf-8"))
        ml_item['datos'] = decoded['telephone']

我認為我需要自定義管道來提取電話號碼

在 pipelines.py 我添加了 jsonWriter 行

ITEM_PIPELINES = {'mercado.pipelines.MercadoPipeline': 500,
                    'mercado.pipelines.MercadoImagenesPipeline': 600,
                    'mercado.pipelines.JsonWriterPipeline': 800, }

但是我需要在 pipelines.py 中添加一些代碼來定義 JsonWriterPipeline。 控制台返回此錯誤:

raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'mercado.pipelines' doesn't define any object named 'JsonWriterPipeline'

我將所有數字保存在 CSV 文件中,其中包含名稱、Web 等其他信息...

如果您已經爬取了腳本標簽內的內容,這很簡單

import re

script = '{"@context":"http://schema.org","@type":"LocalBusiness","name":"Clínica Dental Reina Victoria 23","description":".TU CLÍNICA DENTAL DE REFERENCIA EN MADRID","logo":"https://estaticos.qdq.com/CMS/directory/logos/c/l/clinica-dental-reina-victoria.png","image":"https://estaticos.qdq.com/coverphotos/098/535/ed1c5ffcf38241f8b83a1808af51a615.jpg","url":"https://www.clinicadental-reinavictoria.es/","hasMap":"https://www.google.com/maps/search/?api=1&query=40.4469174,-3.7087934","telephone":"+34915340309","address":{"@type":"PostalAddress","streetAddress":"Av. Reina Victoria 23","addressLocality":"MADRID","addressRegion":"Madrid","postalCode":"28003"}}'

phone_number = re.search(r'"telephone":"(.*?)","address"', script).group(1)

print(phone_number)

最簡單快捷的選擇是,我也更喜歡這個。

import json

json.loads(response.css('script:contains("LocalBusiness") ::text').re_first('(.*)'))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM