[英]How to use JSON containing URL and ID in Scrapy and structure the results?
I'm using Scrapy to scrape datas from 9000+ urls contained in a JSON with a matching ID. 我正在使用Scrapy从具有匹配ID的JSON中包含的9000多个URL中抓取数据。 Here is my JSON object type: 这是我的JSON对象类型:
[{
"objectID": 10500,
"gm_url": "https://reddit.com/1"
},
"objectID": 10501,
"gm_url": "https://reddit.com/2"
}]
I'd like to have my results in a json with the scraped datas, the matching url and the id. 我想将我的结果保存在json中,其中包含抓取的数据,匹配的url和id。
[{
"objectID": 10500,
"gm_url": "https://reddit.com",
"results": [
{
"model": "",
"price": "",
"auction": "",
"date": "",
"auction_url": "",
"img": ""
},
{
"model": "",
"price": "",
"auction": "",
"date": "",
"auction_url": "",
"img": ""
},
{
"model": "",
"price": "",
"auction": "",
"date": "",
"auction_url": "",
"img": ""
}
]
}]
Here is my code right now in scrapy (which is kind of messy): 这是我现在的代码很杂乱(有点杂乱):
import json
import scrapy
with open('/home/bolgi/Workspace/Dev/python_workspace/gm_spider/Json/db_urls_glenmarch_results_scrapy_reduced.json', encoding='utf-8') as data_file:
data = json.load(data_file)
for item in data:
objectId = item['objectID']
gmUrl = item['gm_url']
class GlenMarchSpider(scrapy.Spider):
name = 'glenmarch'
def start_requests(self):
start_urls = gmUrl
for url in start_urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
for caritem in response.css("div.car-item-border"):
yield {
"url": response.url,
"model": caritem.css("div.make::text").get(),
"price": caritem.css("div.price::text").get(),
"auction": caritem.css("div.auctionHouse::text").get(),
"date": caritem.css("div.date::text").get(),
"auction_url": caritem.css("div.view-auction a::attr(href)").get(),
"img": caritem.css("img.img-responsive::attr(src)").get()
}
I don't know how to structure the code and how to use the JSON file, I'm new in python and it's a bit difficult for me. 我不知道如何构造代码以及如何使用JSON文件,我是python的新手,这对我来说有点困难。
You should never declare a class inside a forloop. 您永远不要在forloop中声明一个类。
I suggest you the following structure: 我建议您采用以下结构:
import json
import scrapy
class GlenMarchSpider(scrapy.Spider):
name = 'glenmarch'
def __init__(self):
with open('/home/bolgi/Workspace/Dev/python_workspace/gm_spider/Json/db_urls_glenmarch_results_scrapy_reduced.json', encoding='utf-8') as data_file:
self.data = json.load(data_file)
def start_requests(self):
for item in self.data:
request = scrapy.Request(item['gm_url'], callback=self.parse)
request.meta['item'] = item
yield request
def parse(self, response):
item = response.meta['item']
item['results'] = []
for caritem in response.css("div.car-item-border"):
item['results'].append({
"model": caritem.css("div.make::text").get(),
"price": caritem.css("div.price::text").get(),
"auction": caritem.css("div.auctionHouse::text").get(),
"date": caritem.css("div.date::text").get(),
"auction_url": caritem.css("div.view-auction a::attr(href)").get(),
"img": caritem.css("img.img-responsive::attr(src)").get()
})
yield item
Then you can call your spider (and save it in a new json file): 然后,您可以调用蜘蛛(并将其保存在新的json文件中):
$ scrapy crawl glenmarch -o myjson.json -t json
If there are things in the code that you do not understand do not hesitate to ask for clarifications ! 如果代码中有您不理解的内容,请随时进行澄清! :) :)
The scrapy.spider
also has start_urls
list field which in default empty where you can append all the urls
. scrapy.spider
还具有start_urls
列表字段,该字段默认为空,您可以在其中附加所有urls
。
import scrapy
import json
class GlenMarchSpider(scrapy.Spider)
name = 'glenmarch'
start_urls = []
with open('/home/bolgi/Workspace/Dev/python_workspace/gm_spider/Json/db_urls_glenmarch_results_scrapy_reduced.json', encoding='utf-8') as json_file:
data = json.load(json_file)
for item in data:
objectId = item['objectID']
gmUrl = item['gm_url']
start_urls.append(gmUrl)
def parse(self, response):
item = {}
for caritem in response.css("div.car-item-border"):
yield {
"url": response.url,
"model": caritem.css("div.make::text").get(),
"price": caritem.css("div.price::text").get(),
"auction": caritem.css("div.auctionHouse::text").get(),
"date": caritem.css("div.date::text").get(),
"auction_url": caritem.css("div.view-auction a::attr(href)").get(),
"img": caritem.css("img.img-responsive::attr(src)").get()
}
And you can run the spider in that way too: 您也可以通过这种方式运行Spider:
scrapy runspider quotes_spider.py -o glenmarch.json
For more details, please check out the official document or feel free to ask. 有关更多详细信息,请查阅官方文档或随时询问。 https://scrapy.readthedocs.io/en/latest/intro/overview.html https://scrapy.readthedocs.io/en/latest/intro/overview.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.