简体   繁体   中英

Python - Web scraping with Beautiful Soup

I am currently trying to reproduce a web scraping example with Beautiful Soup. However, I have to say I find it pretty unintuitive, which of course might alse be due to lack of experience. In case anyone could help me with an example I'd appreciate it. I cannot find much relevant information online. I would like to extract the first value (Dornum) of the following website: http://flow.gassco.no/

I only got this far:

import requests

page = requests.get("http://flow.gassco.no/")

from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'html.parser')

Thank you in advance!

You need to learn how to use urllib , urllib2 first.

Some website shield spiders.

something like:

urllib2.request.add_header('User-Agent','Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')

Let website think you are Browser, not robot.

Another way is to use current requests module. You can pass user-agent like this:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36'
}

page = requests.get("http://flow.gassco.no/", headers=headers)

soup = BeautifulSoup(page.text, 'html.parser')

EDIT : To make this version work straightforward you can make a workaround with browser sessions. You need to pass with requests.get a cookie that tells the site a session number, where Terms and Conditions are already accepted.

Run this code:

import requests
from bs4 import BeautifulSoup

url = "http://flow.gassco.no"
s = requests.Session()
r = s.get(url)
action = BeautifulSoup(r.content, 'html.parser').find('form').get('action') #this gives a "tail" of url whick indicates acceptance of Terms
s.get(url+action)
page = s.get(url).content
soup = BeautifulSoup(page, 'html.parser')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM