简体   繁体   中英

Web-scraping from pages with the same link

I am trying to scrape some information from this website: https://www.nordnet.se/marknaden/aktiekurser?sortField=name&sortOrder=asc&exchangeCountry=SE&exchangeList=se%3Alargecapstockholmsek .

What I want to do is grab the sector information for each company, which is provided under the "Om bolaget"-tab in the company-specific pages. More specifically the information I want to get is in the "Sektor" and "Branch" fields. The links to the company specific pages can easily be obtained with requests and BeautifulSoup in python.

When making a get request to these links, the response sometimes contains the wanted information in the following form "sector: ..." and "sector_group: ...", but not always. One example when it works is for Latour https://www.nordnet.se/marknaden/aktiekurser/16099736-latour-investmentab-b , and one example when is doesn't work is for EQT https://www.nordnet.se/marknaden/aktiekurser/17117956-eqt .

Note that I see that an XHR-request (POST-request) is being made when pressing "Om bolaget", but I am not sure how to exploit it.

The code I use to grab the sector information from a company-specific page is provided below:

import requests
from bs4 import BeautifulSoup
import re

def get_sector(url):

    sector, sector_group = None, None
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    tags = soup.findAll('script')
    for tag in tags:
        content = tag.get_text()
        content = content.replace('\\', '')
        if '__initialState__' not in content:
            continue
        try:
            sector = re.findall(r'"sector":"\w+"', content)[0]
            sector = json.loads('{' + sector + '}')
            sector = sector['sector']
        except IndexError:
            print(url)
            print('Sector not found')

        try:
            sector_group = re.findall(r'"sector_group":"\w+"', content)[0]
            sector_group = json.loads('{' + sector_group + '}')
            sector_group = sector_group['sector_group']
        except IndexError:
            print('Sector Group not found')

        break

    return sector, sector_group

Any input would be much appreciated.

To get Om bolaget batch you have to get ntag from https://www.nordnet.se/api/2/login/anonymous response headers. You can take it once and use later in other requests. Best way is to use ntag from https://www.nordnet.se/api/2/login/anonymous response headers. You can take it once and use later in other requests. Best way is to use ntag from https://www.nordnet.se/api/2/login/anonymous response headers. You can take it once and use later in other requests. Best way is to use requests.session() for that. In for that. In data` 17117956 and 16099736 should be variables:

headers = {
    'Connection': 'keep-alive',
    'Content-Length': '0',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Origin': 'https://www.nordnet.se',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
    'ntag': 'NO_NTAG_RECEIVED_YET',
    'content-type': 'application/x-www-form-urlencoded',
    'accept': 'application/json',
    'client-id': 'NEXT',
    'DNT': '1',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.nordnet.se/se',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
}

with requests.session() as s:
    r = s.post('https://www.nordnet.se/api/2/login/anonymous', headers=headers)

    headers['ntag'] = r.headers['ntag']
    headers['content-type'] = 'application/json'
    headers['accept'] = 'application/json'

    for company_id in ['17117956', '16099736']:
        data = '{"batch":"[{\\"relative_url\\":\\"company_data/keyfigures/' + company_id + '\\",\\"method\\":\\"GET\\"},{\\"relative_url\\":\\"company_data/yearlyfinancial/' + company_id + '\\",\\"method\\":\\"GET\\"},{\\"relative_url\\":\\"company_data/summary/' + company_id + '\\",\\"method\\":\\"GET\\"}]"}'
        r = s.post('https://www.nordnet.se/api/2/batch', headers=headers, data=data)
        print(r.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM