简体   繁体   中英

Scrapy + Python, Error in Finding links from a website

I am trying to find the URLs of all the events of this page:

https://www.eventshigh.com/delhi/food?src=exp

But I can see the URL only in a JSON format:

 {
    "@context":"http://schema.org",
    "@type":"Event",
    "name":"DANDIYA NIGHT 2018",
    "image":"https://storage.googleapis.com/ehimages/2018/9/4/img_b719545523ac467c4ad206c3a6e76b65_1536053337882_resized_1000.jpg",
    "url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018",
    "eventStatus": "EventScheduled",

    "startDate":"2018-10-14T18:30:00+05:30",
    "doorTime":"2018-10-14T18:30:00+05:30",

      "endDate":"2018-10-14T22:30:00+05:30",

    "description" : "Dress code : TRADITIONAL (mandatory)\u00A0 \r\n Dandiya sticks will be available at the venue ( paid)\u00A0 \r\n Lip smacking food, professional dandiya Dj , media coverage , lucky draw \u00A0, Dandiya Garba Raas , Shopping and Games .\u00A0 \r\n \u00A0 \r\n Winners\u00A0 \r\n \u00A0 \r\n Best dress ( all",
    "location":{
      "@type":"Place",


          "name":"K And L Community Hall (senior Citizen Complex )",


          "address":"80 TO 49, Pocket K, Sarita Vihar, New Delhi, Delhi 110076, India"



    },

Here it is:

"url":"https://www.eventshigh.com/detail/Delhi/5b30d4b8462a552a5ce4a5ebcbefcf47-dandiya-night-2018"

But I cannot find any other HTML/XML tag which contains the links. Also I cannot find the corresponding JSON file which contains the links. Could you please help me to scrape the links of all events of this page:

https://www.eventshigh.com/delhi/food?src=exp

Gathering information from a JavaScript-powered page, like this one, may look daunting at first; but can often be more productive in fact, as all the information is in one place, instead of scattered across a lot of expensive HTTP-request lookups.

So when a page gives you JSON-data like this, you can thank them by being nice to the server and use it! :)
With a little time invested into "source-view analysis", which you have already gathered, this will also be more efficient than trying to get the information through an (expensive) Selenium/Splash/ect.-renderpipe.

The tool that is invaluable to get there, is XPath . Sometimes a little additional help from our friend regex may be required.
Assuming you have successfully fetched the page, and have a Scrapy response object (or you have a Parsel.Selector() over an otherwise gathered response-body), you will be able to access the xpath() method as response.xpath or selector.xpath :

>>> response.status
200

You have determined the data exists as plain text (json), so we need to drill down to where it hides, to ultimately extract the raw JSON content. After that, converting it to a Python dict for further use will be trivial. In this case it's inside a container node <script type="application/ld+json"> . Our XPath for that could look like this:

>>> response.xpath('//script[@type="application/ld+json"]')
[<Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n{\n  '>,
 <Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n{\n  '>,
 <Selector xpath='//script[@type="application/ld+json"]' data='<script type="application/ld+json">\n    '>]

This will find every "script" node in the xml page which has an attribute of "type" with value "application/ld+json". Apparently that is not specific enough, since we find three nodes ( Selector -wrapped in our returned list).

From your analysis we know that our JSON must contain the "@type":"Event" , so let our xpath do a little substring-search for that:

>>> response.xpath("""//script[@type="application/ld+json"]/self::node()[contains(text(), '"@type":"Event"')]""")
[<Selector xpath='//script[@type="application/ld+json"]/self::node()[contains(text(), \'"@type":"Event"\')]' data='<script type="application/ld+json">\n    '>]

Here we added a second qualifier which says our script node must contain the given text .
(The 'self::node()' shows some XPath axes magic to reference back to our current script node at this point - instead of its descendents. We will simplify this though.)
Now our return list contains a single node/Selector. As we see from the data= string, if we were to extract() this, we would now get some string like <script type="application/ld+json">[...]</script> . Since we care about the content of the node, but not the node itself, we have one more step to go:

>>> response.xpath("""//script[@type="application/ld+json"][contains(text(), '"@type":"Event"')]/text()""")
[<Selector xpath='//script[@type="application/ld+json"][contains(text(), \'"@type":"Event"\')]/text()' data='\n        [\n          \n            \n     '>]

And this returns (a SelectorList of) our target text() . As you may see we could also do away with the self-reference. Now, xpath() always returns a SelectorList , but we have a little helper for this: response.xpath().extract_first() will grab the list's first element -checking if it exists- before processing it. We can put this result into a data variable, after which it's simple to json.loads(data) this into a Python dictionary and look up our values:

>>> events = json.loads(data)
>>> [item['url'] for item in events]
['<url>',
 '<url>',
 '<url>',
 '<url>']

Now you can turn them into scrapy.Request(url) s, and you'll know how to continue from there.

.
As always, crawl responsibly and keep the 'net a nice place to be. I do not endorse any unlawful behavior.
Assessing one's rights or gaining permission to access specified target ressource is one's own responsibility.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM