Python-使用精美湯進行網頁抓取

Question

我目前正在嘗試使用Beautiful Soup重現網絡抓取示例。 但是，我不得不說我覺得這很不直觀，這當然也可能是由於缺乏經驗。 如果有人可以幫我舉一個例子，我將不勝感激。 我在網上找不到很多相關信息。 我想提取以下網站的第一個值（Dornum）： http : //flow.gassco.no/

我只有這么遠：

import requests

page = requests.get("http://flow.gassco.no/")

from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'html.parser')

先感謝您！

Answer 1

您需要先學習如何使用urllib ， urllib2 。

一些網站屏蔽蜘蛛。

就像是：

urllib2.request.add_header('User-Agent','Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')

讓網站認為您是瀏覽器，而不是機器人。

Answer 2

另一種方法是使用當前requests模塊。 您可以像這樣傳遞user-agent ：

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36'
}

page = requests.get("http://flow.gassco.no/", headers=headers)

soup = BeautifulSoup(page.text, 'html.parser')

編輯：要使此版本簡單易用，您可以對瀏覽器會話進行變通。 您需要傳遞requests.get一個cookie ，該cookie告訴站點一個會話號，其中條款和條件已被接受。

運行此代碼：

import requests
from bs4 import BeautifulSoup

url = "http://flow.gassco.no"
s = requests.Session()
r = s.get(url)
action = BeautifulSoup(r.content, 'html.parser').find('form').get('action') #this gives a "tail" of url whick indicates acceptance of Terms
s.get(url+action)
page = s.get(url).content
soup = BeautifulSoup(page, 'html.parser')

Python-使用精美湯進行網頁抓取

問題描述

2 個解決方案

解決方案1
1 2017-08-24 13:14:17

解決方案2
1 已采納 2017-08-24 13:31:20

Python-使用精美湯進行網頁抓取

問題描述

2 個解決方案

解決方案1 1 2017-08-24 13:14:17

解決方案2 1 已采納 2017-08-24 13:31:20

解決方案1
1 2017-08-24 13:14:17

解決方案2
1 已采納 2017-08-24 13:31:20