简体   繁体   English

Scrapy + Python,从网站查找链接时出错

[英]Scrapy + Python, Error in Finding links from a website

I am trying to find the URLs of all the events of this page: 我正在尝试查找此页面所有事件的URL:

https://www.eventshigh.com/delhi/food?src=exp

But I can see the URL only in a JSON format: 但是我只能以JSON格式看到URL:

 {
    "@context":"http://schema.org",
    "@type":"Event",
    "name":"DANDIYA NIGHT 2018",
    "image":"https://storage.googleapis.com/ehimages/2018/9/4/img_b719545523ac467c4ad206c3a6e76b65_1536053337882_resized_1000.jpg",
    "url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018",
    "eventStatus": "EventScheduled",

    "startDate":"2018-10-14T18:30:00+05:30",
    "doorTime":"2018-10-14T18:30:00+05:30",

      "endDate":"2018-10-14T22:30:00+05:30",

    "description" : "Dress code : TRADITIONAL (mandatory)\u00A0 \r\n Dandiya sticks will be available at the venue ( paid)\u00A0 \r\n Lip smacking food, professional dandiya Dj , media coverage , lucky draw \u00A0, Dandiya Garba Raas , Shopping and Games .\u00A0 \r\n \u00A0 \r\n Winners\u00A0 \r\n \u00A0 \r\n Best dress ( all",
    "location":{
      "@type":"Place",


          "name":"K And L Community Hall (senior Citizen Complex )",


          "address":"80 TO 49, Pocket K, Sarita Vihar, New Delhi, Delhi 110076, India"



    },

Here it is: 这里是:

"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018"

But I cannot find any other HTML/XML tag which contains the links. 但是我找不到包含链接的任何其他HTML / XML标签。 Also I cannot find the corresponding JSON file which contains the links. 我也找不到包含链接的相应JSON文件。 Could you please help me to scrape the links of all events of this page: 您能否帮助我抓取此页面所有事件的链接:

https://www.eventshigh.com/delhi/food?src=exp

Gathering information from a JavaScript-powered page, like this one, may look daunting at first; 首先,从JavaScript驱动的页面中收集信息可能令人生畏。 but can often be more productive in fact, as all the information is in one place, instead of scattered across a lot of expensive HTTP-request lookups. 但实际上,由于所有信息都集中在一个地方,而不是分散在许多昂贵的HTTP请求查询中,因此通常可以提高生产力。

So when a page gives you JSON-data like this, you can thank them by being nice to the server and use it! 因此,当页面为您提供这样的JSON数据时,您可以通过对服务器友好的使用来感谢他们 :) :)
With a little time invested into "source-view analysis", which you have already gathered, this will also be more efficient than trying to get the information through an (expensive) Selenium/Splash/ect.-renderpipe. 您已经花了一些时间在“源视图分析”上,这比尝试通过(昂贵的)Selenium / Splash / ect.-renderpipe获取信息更为有效。

The tool that is invaluable to get there, is XPath . XPath是实现这一目标的无价之宝。 Sometimes a little additional help from our friend regex may be required. 有时,可能需要我们的朋友regex提供一些其他帮助。
Assuming you have successfully fetched the page, and have a Scrapy response object (or you have a Parsel.Selector() over an otherwise gathered response-body), you will be able to access the xpath() method as response.xpath or selector.xpath : 假设您已成功获取页面,并具有Scrapy response对象(或在其他收集的响应主体上具有Parsel.Selector() ),则可以作为response.xpathselector.xpath访问xpath()方法selector.xpath

>>> response.status
200

You have determined the data exists as plain text (json), so we need to drill down to where it hides, to ultimately extract the raw JSON content. 您已经确定数据以纯文本(json)的形式存在,因此我们需要向下钻取至其隐藏位置,以最终提取原始JSON内容。 After that, converting it to a Python dict for further use will be trivial. 之后,将其转换为Python字典以进一步使用将是微不足道的。 In this case it's inside a container node <script type="application/ld+json"> . 在这种情况下,它位于容器节点<script type="application/ld+json"> Our XPath for that could look like this: 我们的XPath可能如下所示:

>>> response.xpath('//script[@type="application/ld+json"]')
[<Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n{\n  '>,
 <Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n{\n  '>,
 <Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n    '>]

This will find every "script" node in the xml page which has an attribute of "type" with value "application/ld+json". 这将在xml页面中找到每个具有属性 “ type”和值为 “ application / ld + json”的“ script” 节点 Apparently that is not specific enough, since we find three nodes ( Selector -wrapped in our returned list). 显然,这还不够具体,因为我们找到了三个节点(在返回列表中包装了Selector )。

From your analysis we know that our JSON must contain the "@type":"Event" , so let our xpath do a little substring-search for that: 根据您的分析,我们知道JSON必须包含"@type":"Event" ,因此让我们的xpath为此进行一些子字符串搜索:

>>> response.xpath("""//script[@type="application/ld+json"]/self::node()[contains(text(), '"@type":"Event"')]""")
[<Selector xpath='//script[@type="application/ld+json"]/self::node()[contains(text(), \'"@type":"Event"\')]' data='<script type="application/ld+json">\n    '>]

Here we added a second qualifier which says our script node must contain the given text . 在这里,我们添加了第二个限定符,它表示我们的script 节点必须包含给定的文本
(The 'self::node()' shows some XPath axes magic to reference back to our current script node at this point - instead of its descendents. We will simplify this though.) (“ self :: node()”显示了一些XPath轴魔术,可以在此时参考到当前script节点-而不是其后代。不过,我们将对此进行简化。)
Now our return list contains a single node/Selector. 现在,我们的返回列表包含一个节点/选择器。 As we see from the data= string, if we were to extract() this, we would now get some string like <script type="application/ld+json">[...]</script> . 正如我们从data=字符串中看到的那样,如果要将其extract() ,我们现在将获得诸如<script type="application/ld+json">[...]</script>字符串。 Since we care about the content of the node, but not the node itself, we have one more step to go: 由于我们关心的是节点的内容,而不是节点本身,因此,我们还有一步:

>>> response.xpath("""//script[@type="application/ld+json"][contains(text(), '"@type":"Event"')]/text()""")
[<Selector xpath='//script[@type="application/ld+json"][contains(text(), \'"@type":"Event"\')]/text()' data='\n        [\n          \n            \n     '>]

And this returns (a SelectorList of) our target text() . 这将返回我们的目标text()SelectorList As you may see we could also do away with the self-reference. 如您所见,我们也可以取消自我参照。 Now, xpath() always returns a SelectorList , but we have a little helper for this: response.xpath().extract_first() will grab the list's first element -checking if it exists- before processing it. 现在, xpath()总是返回一个SelectorList ,但是我们为此提供了一些帮助: response.xpath().extract_first()将在处理列表之前获取列表的第一个元素-检查它是否存在。 We can put this result into a data variable, after which it's simple to json.loads(data) this into a Python dictionary and look up our values: 我们可以将结果放入data变量中,然后将json.loads(data)放入Python字典并查找值很简单:

>>> events = json.loads(data)
>>> [item['url'] for item in events]
['<url>',
 '<url>',
 '<url>',
 '<url>']

Now you can turn them into scrapy.Request(url) s, and you'll know how to continue from there. 现在,您可以将它们转换为scrapy.Request(url) ,您将知道如何从那里继续。

.
As always, crawl responsibly and keep the 'net a nice place to be. 与往常一样,以负责任的方式进行爬网,并保持网上成为一个不错的地方。 I do not endorse any unlawful behavior. 我不认可任何非法行为。
Assessing one's rights or gaining permission to access specified target ressource is one's own responsibility. 评估自己的权利或获得访问指定目标资源的许可是自己的责任。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM