简体   繁体   中英

How scrape data from javascript with Scrapy 1.4.0?

sorry for my english. I'm beginner in scrapy and i need some guidance. I had a problem with scraping off some site. This is my spider:

import scrapy
from bs4 import BeautifulSoup as bs

class SomeSiteSpider(scrapy.Spider):
    name = 'somesite'

    def start_requests(self):
        urls = [
            'http://somesite.ru/proxies/'
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        token = response.css('input[name="xf0"]::attr(value)').extract_first()
        data = {
            'xpp': '4',
            'xf1': '4',
            'xf0': token,
            'xf2': '0',
            'xf4': '0'
        }
        yield scrapy.FormRequest(url='http://somesite.ru/proxies/', formdata=data, callback=self.parse_proxy, method='POST')

    def parse_proxy(self, response):
        page = bs(response.body, "html.parser")
        table = page.select('td[align="center"] > table[cellspacing="1"]')
        table = bs(str(table), 'html.parser')
        print(table.prettify())

I need parse this:

<font class="spy14">
  "200.200.200.200"
  <script type="text/javascript"></script>
  <font class="spy2">:</font>
  "8080"
</font>

But my spiders output:

<font class="spy14">
    200.200.200.200
    <script type="text/javascript">
     document.write("<font class=spy2>:<\/font>"+(l2k1o5^f6l2)+(j0s9i9^e5z6)+(i9w3m3^s9p6)+(g7u1q7^u1j0)+(h8x4r8^n4s9))
    </script>
</font>

AJAX requests on this site is absent.

Picture of spider output

Scrapy doesn't execute Javascript out of the box. To get this done, you need to integrate a browser simulation like PhantomJS or Splash into scrapy. You can also use Selenium to render the Javascript in a real browser instance though that's even more complex.

For getting started I would recommend using Splash. It is well documented and also integrates very well with scrapy as it is built by the scrapy developers. A good point to getting started is here: https://github.com/scrapy-plugins/scrapy-splash

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM