Scrapy + Python，从网站查找链接时出错

Question

I am trying to find the URLs of all the events of this page: 我正在尝试查找此页面所有事件的URL：

https://www.eventshigh.com/delhi/food?src=exp

But I can see the URL only in a JSON format: 但是我只能以JSON格式看到URL：

 {
    "@context":"http://schema.org",
    "@type":"Event",
    "name":"DANDIYA NIGHT 2018",
    "image":"https://storage.googleapis.com/ehimages/2018/9/4/img_b719545523ac467c4ad206c3a6e76b65_1536053337882_resized_1000.jpg",
    "url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018",
    "eventStatus": "EventScheduled",

    "startDate":"2018-10-14T18:30:00+05:30",
    "doorTime":"2018-10-14T18:30:00+05:30",

      "endDate":"2018-10-14T22:30:00+05:30",

    "description" : "Dress code : TRADITIONAL (mandatory)\u00A0 \r\n Dandiya sticks will be available at the venue ( paid)\u00A0 \r\n Lip smacking food, professional dandiya Dj , media coverage , lucky draw \u00A0, Dandiya Garba Raas , Shopping and Games .\u00A0 \r\n \u00A0 \r\n Winners\u00A0 \r\n \u00A0 \r\n Best dress ( all",
    "location":{
      "@type":"Place",


          "name":"K And L Community Hall (senior Citizen Complex )",


          "address":"80 TO 49, Pocket K, Sarita Vihar, New Delhi, Delhi 110076, India"



    },

Here it is: 这里是：

"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018"

But I cannot find any other HTML/XML tag which contains the links. 但是我找不到包含链接的任何其他HTML / XML标签。 Also I cannot find the corresponding JSON file which contains the links. 我也找不到包含链接的相应JSON文件。 Could you please help me to scrape the links of all events of this page: 您能否帮助我抓取此页面所有事件的链接：

https://www.eventshigh.com/delhi/food?src=exp

Answer 1

Gathering information from a JavaScript-powered page, like this one, may look daunting at first; 首先，从JavaScript驱动的页面中收集信息可能令人生畏。 but can often be more productive in fact, as all the information is in one place, instead of scattered across a lot of expensive HTTP-request lookups. 但实际上，由于所有信息都集中在一个地方，而不是分散在许多昂贵的HTTP请求查询中，因此通常可以提高生产力。

So when a page gives you JSON-data like this, you can thank them by being nice to the server and use it! 因此，当页面为您提供这样的JSON数据时，您可以通过对服务器友好的使用来感谢他们！ :) :)
With a little time invested into "source-view analysis", which you have already gathered, this will also be more efficient than trying to get the information through an (expensive) Selenium/Splash/ect.-renderpipe. 您已经花了一些时间在“源视图分析”上，这比尝试通过（昂贵的）Selenium / Splash / ect.-renderpipe获取信息更为有效。

The tool that is invaluable to get there, is XPath . XPath是实现这一目标的无价之宝。 Sometimes a little additional help from our friend regex may be required. 有时，可能需要我们的朋友regex提供一些其他帮助。
Assuming you have successfully fetched the page, and have a Scrapy response object (or you have a Parsel.Selector() over an otherwise gathered response-body), you will be able to access the xpath() method as response.xpath or selector.xpath : 假设您已成功获取页面，并具有Scrapy response对象（或在其他收集的响应主体上具有Parsel.Selector() ），则可以作为response.xpath或selector.xpath访问xpath()方法selector.xpath ：

>>> response.status
200

You have determined the data exists as plain text (json), so we need to drill down to where it hides, to ultimately extract the raw JSON content. 您已经确定数据以纯文本（json）的形式存在，因此我们需要向下钻取至其隐藏位置，以最终提取原始JSON内容。 After that, converting it to a Python dict for further use will be trivial. 之后，将其转换为Python字典以进一步使用将是微不足道的。 In this case it's inside a container node <script type="application/ld+json"> . 在这种情况下，它位于容器节点<script type="application/ld+json"> 。 Our XPath for that could look like this: 我们的XPath可能如下所示：

>>> response.xpath('//script[@type="application/ld+json"]')
[<Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n{\n  '>,
 <Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n{\n  '>,
 <Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n    '>]

This will find every "script" node in the xml page which has an attribute of "type" with value "application/ld+json". 这将在xml页面中找到每个具有属性 “ type”和值为 “ application / ld + json”的“ script” 节点。 Apparently that is not specific enough, since we find three nodes ( Selector -wrapped in our returned list). 显然，这还不够具体，因为我们找到了三个节点（在返回列表中包装了Selector ）。

From your analysis we know that our JSON must contain the "@type":"Event" , so let our xpath do a little substring-search for that: 根据您的分析，我们知道JSON必须包含"@type":"Event" ，因此让我们的xpath为此进行一些子字符串搜索：

>>> response.xpath("""//script[@type="application/ld+json"]/self::node()[contains(text(), '"@type":"Event"')]""")
[<Selector xpath='//script[@type="application/ld+json"]/self::node()[contains(text(), \'"@type":"Event"\')]' data='<script type="application/ld+json">\n    '>]

Here we added a second qualifier which says our script node must contain the given text . 在这里，我们添加了第二个限定符，它表示我们的script 节点必须包含给定的文本。
(The 'self::node()' shows some XPath axes magic to reference back to our current script node at this point - instead of its descendents. We will simplify this though.) （“ self :: node（）”显示了一些XPath轴魔术，可以在此时参考到当前script节点-而不是其后代。不过，我们将对此进行简化。）
Now our return list contains a single node/Selector. 现在，我们的返回列表包含一个节点/选择器。 As we see from the data= string, if we were to extract() this, we would now get some string like <script type="application/ld+json">[...]</script> . 正如我们从data=字符串中看到的那样，如果要将其extract() ，我们现在将获得诸如<script type="application/ld+json">[...]</script>字符串。 Since we care about the content of the node, but not the node itself, we have one more step to go: 由于我们关心的是节点的内容，而不是节点本身，因此，我们还有一步：

>>> response.xpath("""//script[@type="application/ld+json"][contains(text(), '"@type":"Event"')]/text()""")
[<Selector xpath='//script[@type="application/ld+json"][contains(text(), \'"@type":"Event"\')]/text()' data='\n        [\n          \n            \n     '>]

And this returns (a SelectorList of) our target text() . 这将返回我们的目标text()的SelectorList 。 As you may see we could also do away with the self-reference. 如您所见，我们也可以取消自我参照。 Now, xpath() always returns a SelectorList , but we have a little helper for this: response.xpath().extract_first() will grab the list's first element -checking if it exists- before processing it. 现在， xpath()总是返回一个SelectorList ，但是我们为此提供了一些帮助： response.xpath().extract_first()将在处理列表之前获取列表的第一个元素-检查它是否存在。 We can put this result into a data variable, after which it's simple to json.loads(data) this into a Python dictionary and look up our values: 我们可以将结果放入data变量中，然后将json.loads(data)放入Python字典并查找值很简单：

>>> events = json.loads(data)
>>> [item['url'] for item in events]
['<url>',
 '<url>',
 '<url>',
 '<url>']

Now you can turn them into scrapy.Request(url) s, and you'll know how to continue from there. 现在，您可以将它们转换为scrapy.Request(url) ，您将知道如何从那里继续。

. 。
_{As always, crawl responsibly and keep the 'net a nice place to be.} _{与往常一样，以负责任的方式进行爬网，并保持网上成为一个不错的地方。} _{I do not endorse any unlawful behavior.} _{我不认可任何非法行为。} _{Assessing one's rights or gaining permission to access specified target ressource is one's own responsibility.} _{评估自己的权利或获得访问指定目标资源的许可是自己的责任。}

Scrapy + Python，从网站查找链接时出错

问题描述

1 个解决方案

解决方案1
0 2018-09-20 11:51:09

Scrapy + Python，从网站查找链接时出错

问题描述

1 个解决方案

解决方案1 0 2018-09-20 11:51:09

解决方案1
0 2018-09-20 11:51:09