简体   繁体   中英

python scrapy - scraping from onclick popup dialog

I'm trying to scrape all the links to videos and English transcripts from this site using scrapy and python

I got the spider to scrape all the video URL's from all pages (NB. i am useless at programming), but i cant figure out how to scrape the transcripts. The transcript dialog only pops up after clicking a button. The links to the transcripts are found on this new popup. All other tutorials I've read address POST requests, but it seems like this is an ajax GET request. (So I'm completely clueless what to do). I've also seen posts that mention payloads and form control, but i have no idea what they are for this site

relevant HTML from page before button click:

  <span class="transcription make-cursor" onclick="showTranscriptionDialog('17394')"> <img class="video-doclet-icons" src="images/transcript4.png" title="Download Transcription, Tercüme'yi indir, تحميل النص" alt="Transcription" data-pin-nopin="true"></span> 

relevant HTML after click (of dialog popup):

  <span class="ui-corner-all" id="transcription-language-list17394" style="background-color: rgb(245, 243, 229); color: rgb(51, 51, 51);"> <a class="transcription-language-list" target="_blank" href="http://saltanat-transcriptions.s3.amazonaws.com/english/2017-08-08_en_NothingMeansEverything_SB.pdf" onmouseover="transcriptionLanguageMouseOver(17394)" onmouseout="transcriptionLanguageMouseOut(17394)" style="color: rgb(51, 51, 51);"> English </a></span> 

my current spider code (not working)

 import scrapy class SuhbaSpider(scrapy.Spider): name = "suhbas" start_urls = ["http://saltanat.org/videos.php?topic=SheikhBahauddin&gopage={numb}".format(numb=numb) for numb in range(1,23)] def parse(self, response): yield { 'video': response.xpath('//span[@class='download make-cursor']/a/@href').extract(), } videoid = response.xpath("substring(//span[@class='media-info make-cursor']/@onclick, 22, 5)").extract() for p in videoid: url = "http://saltanat.org/ajax_transcription.php?vid=" + p yield scrapy.Request(url, callback=self.parse_transcript) def parse_transcript(self, response): yield { 'transcript': response.xpath('//a[contains(@href,'english')]/@href').extract(), } 

Any help would be appreciated, thanks!

Ok, after playing around with the code i got a working solution, the problem was the "substring" command. It shouldn't have been put in the "response.xpath" line. I used an alternate syntax to do the same thing as shown below (viz. getting a substring)

Not working part

 videoid = response.xpath("substring(//span[@class='media-info make-cursor']/@onclick, 22, 5)").extract() for p in videoid: url = "http://saltanat.org/ajax_transcription.php?vid=" + p 

replaced with this working part

 fullvideoid = response.xpath("//span[@class='media-info make-cursor']/@onclick").extract() for videoid in fullvideoid: url = ("http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2]) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM