我检查了您要抓取的网站,并且脚本似乎已包含在html页面中,因此我认为您不需要使用webdriver,而可以只使用request和beautifulsoup 。
使用请求获取html数据:
res = requests.get(url, headers=headers, params=params)
然后,将html文本放入汤中以获取脚本标签,并找到具有var imgInfoData的标签:
soup = BeautifulSoup(res.text, "html5lib")
scripts = soup.findAll('script', attrs={'type':'text/javascript'})
for script in scripts:
if "var imgInfoData" in script.text: #script with imgInfoData captured
return script.text.replace("var imgInfoData =","").strip()[:-1]
只需删除
var imgInfoData =
和
;
文本以获取字符串值,也可以使用regex来获取文本内的json字符串。
完整代码:
import requests
from bs4 import BeautifulSoup
def getimgInfoData():
url = "https://land.naver.com/info/complexGallery.nhn"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
params = {"newComplex":"Y",
"startImage":"Y",
"rletNo":"102235"}
res = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(res.text, "html5lib")
scripts = soup.findAll('script', attrs={'type':'text/javascript'})
for script in scripts:
if "var imgInfoData" in script.text: #script with imgInfoData captured
return script.text.replace("var imgInfoData =","").strip()[:-1]
return None
print(getimgInfoData())
然后,如果需要,只需将结果从getimgInfoData()转换为json 。