简体   繁体   English

将硒与python结合使用,如何从JS中声明的HTML中获取Var <script> element

[英]Using selenium with python, how can I get Var from HTML where it's declared in a JS <script> element

I want to get var declared inside a JS in the htm;. 我想在htm的JS内声明var; but there are no ids, elements. 但没有ID,元素。 How can I get this data? 我如何获得这些数据?

Because there is no address, but only var name, I don't know how to do it 因为没有地址,只有var名称,所以我不知道该怎么做

Website HTML: 网站HTML:

网站HTML图片

<script type="text/javascript">
var imgInfoData = 'data which i want to crawl'

</script>

My python Code: 我的python代码:

#set url
HOMEPAGE = "https://land.naver.com/info/complexGallery.nhn?newComplex=Y&startImage=Y&rletNo=102235"


#open web
driver = webdriver.Firefox()
driver.wait = WebDriverWait(driver, 2)
driver.get(HOMEPAGE)

#try to get text from html
time.sleep(1)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, '//script["var"]'))).text

I check the site you are scraping and it seems the scripts was already included in the html page, so i think you don't need to use webdriver and you can just use requests and beautifulsoup . 我检查了您要抓取的网站,并且脚本似乎已包含在html页面中,因此我认为您不需要使用webdriver,而可以只使用requestbeautifulsoup

get the html data using requests: 使用请求获取html数据:

res = requests.get(url, headers=headers, params=params)

Then Soup the html text to get the script tags and find which tags has the var imgInfoData : 然后,将html文本放入汤中以获取脚本标签,并找到具有var imgInfoData的标签:

soup = BeautifulSoup(res.text, "html5lib")
    scripts = soup.findAll('script', attrs={'type':'text/javascript'})
    for script in scripts:
        if "var imgInfoData" in script.text: #script with imgInfoData captured
            return script.text.replace("var imgInfoData =","").strip()[:-1]

just remove the 只需删除

var imgInfoData = var imgInfoData =

and

; ;

of the text to get the string value or you could use regex to get the json string inside a text. 文本以获取字符串值,也可以使用regex来获取文本内的json字符串。

Full Code: 完整代码:

import requests
from bs4 import BeautifulSoup

def getimgInfoData():
    url = "https://land.naver.com/info/complexGallery.nhn"
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    params = {"newComplex":"Y",
              "startImage":"Y",
              "rletNo":"102235"}
    res = requests.get(url, headers=headers, params=params)

    soup = BeautifulSoup(res.text, "html5lib")
    scripts = soup.findAll('script', attrs={'type':'text/javascript'})
    for script in scripts:
        if "var imgInfoData" in script.text: #script with imgInfoData captured
            return script.text.replace("var imgInfoData =","").strip()[:-1]
    return None

print(getimgInfoData())

then just convert the result from getimgInfoData() to json if you want. 然后,如果需要,只需将结果从getimgInfoData()转换为json

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python使用selenium从JS获取HTML - python Get HTML from JS using selenium 如何使用 selenium 和 python 从网站获取工具提示文本,其中文本来自 javascript - How can I get the tooltip text from a website using selenium and python where the text comes from a javascript 我如何在Python中从selenium.webdriver获取HTML? - How can i get html from selenium.webdriver in Python? 如何使用Selenium python获取基于HTML数据的HTML索引? - How can i get HTML index based on HTML data i have using selenium python? 如果它调用Selenium中的Java脚本函数,如何从Web元素获取文件URL - How can i get the file URL from web element if it calls java script function in Selenium 如何使用 Python+Selenium 从 execute_script 获取 JS 控制台响应代码? - How do I get JS Console response code from execute_script with Python+Selenium? 我可以从HTML元素获取到已经存在的JS元素 - Can I Get From HTML Element to Already Existing JS Element 如何在脚本函数中获取var? - how can I get var in the script function? 如何让这个计时器脚本在 ClickFunnels 的 html/js 元素中运行? - How can I get this timer script to function within a html/js element in ClickFunnels? 如何使用selenium使用css选择器获取所有元素的直接子项? - How can I get all element's immediate children with css selectors using selenium?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM