简体   繁体   中英

Can't scrape names from next pages using requests

I'm trying to parse names traversing multiple pages from a webpage using a python script. With my current attempt I can get the names from it's landing page. However, I can't find any idea to fetch the names from next pages as well using requests and BeautifulSoup.

website link

My attempt so far:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for elem in soup.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

I've tried to modify my script to get only the content from the second page to make sure it is working when there is a next page button involved but unfortunately it still fetches data from the first page:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTARGUMENT'] = 'Page$Next'
    payload.pop('btnClose')
    payload.pop('btnMapClose')
    res = s.post(url,data=payload,headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
        'X-Requested-With':'XMLHttpRequest',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
        })
    sauce = BeautifulSoup(res.text,"lxml")
    for elem in sauce.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

Navigating to next page is being performed via POST request with __VIEWSTATE cursor.

How you can do it with requests:

  1. Make GET request to first page;

  2. Parse required data and __VIEWSTATE cursor;

  3. Prepare POST request for next page with received cursor;

  4. Run it, parse all data and new cursor for next page.

I won't provide any code, because it requires to write down almost all crawler's code.

==== Added ====

You almost done it, but there are two important things you have missed.

  1. It is necessary to send headers with first GET request. If there're no headers sent - we get broken tokens (it is easy to detect visually - they haven't == at the end)

  2. We need to add __ASYNCPOST to payload we send. (It is very interesting: it is not a boolean True, it is a string 'true')

Here's code. I removed bs4 and added lxml (i don't like bs4, it is very slow). We exactly know which data we need to send, so let's parse only few inputs.

import re
import requests
from lxml import etree


def get_nextpage_tokens(response_body):
    """ Parse tokens from XMLHttpRequest response for making next request to next page and create payload """
    try:
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATE'] = re.search(r'__VIEWSTATE\|([^\|]+)', response_body).group(1)
        payload['__VIEWSTATEGENERATOR'] = re.search(r'__VIEWSTATEGENERATOR\|([^\|]+)', response_body).group(1)
        payload['__EVENTVALIDATION'] = re.search(r'__EVENTVALIDATION\|([^\|]+)', response_body).group(1)
        payload['__ASYNCPOST'] = 'true'
        return payload
    except:
        return None


if __name__ == '__main__':
    url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
            }

    with requests.Session() as s:
        page_num = 1
        r = s.get(url, headers=headers)
        parser = etree.HTMLParser()
        tree = etree.fromstring(r.text, parser)

        # Creating payload
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATE'] = tree.xpath("//input[@name='__VIEWSTATE']/@value")[0]
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATEGENERATOR'] = tree.xpath("//input[@name='__VIEWSTATEGENERATOR']/@value")[0]
        payload['__EVENTVALIDATION'] = tree.xpath("//input[@name='__EVENTVALIDATION']/@value")[0]
        payload['__ASYNCPOST'] = 'true'
        headers['X-Requested-With'] = 'XMLHttpRequest'

        while True:
            page_num += 1
            res = s.post(url, data=payload, headers=headers)

            print(f'page {page_num} data: {res.text}')  # FIXME: Parse data

            payload = get_nextpage_tokens(res.text)  # Creating payload for next page
            if not payload:
                # Break if we got no tokens - maybe it was last page (it must be checked)
                break

Important

Response not a well formed HTML. So You have to deal with it: cut table or something else. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM