简体   繁体   English

如何使用Python从网页的inspect元素获取数据

[英]How to get data from inspect element of a webpage using Python

I'd like to get the data from inspect element using Python. 我想使用Python从inspect元素中获取数据。 I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. 我可以使用BeautifulSoup下载源代码,但现在我需要来自网页的inspect元素的文本。 I'd truly appreciate if you could advise me how to do it. 如果你能告诉我怎么做,我真的很感激。

Edit: By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. 编辑:通过检查元素我的意思是,在谷歌浏览器中,右键单击为我们提供了一个名为inspect元素的选项,其中包含与该特定页面的每个元素相关的代码。 I'd like to extract that code/ just its text strings. 我想提取该代码/只是它的文本字符串。

If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. 如果您想以运行Javascript的方式从Python自动获取网页,您应该查看Selenium。 It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open). 它可以自动驱动Web浏览器(甚至是无头Web浏览器,如PhantomJS,因此您不必打开窗口)。

In order to get the HTML, you'll need to evaluate some javascript. 为了获取HTML,您需要评估一些JavaScript。 Simple sample code, alter to suit: 简单的示例代码,改为适合:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://google.com")

# This will get the initial html - before javascript
html1 = driver.page_source

# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")

Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want. 注意1:如果你想要一个或多个特定的元素,你实际上有几个选项 - 用Python解析HTML,或编写更多特定的JavaScript来返回你想要的东西。

Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. 注意2:如果您确实需要Chrome工具中的特定信息,而不仅仅是动态生成的HTML,那么您需要一种方法来嵌入Chrome本身。 No way around that. 没办法解决这个问题。

Inspect element shows all the HTML of the page which is the same as fetching the html using urllib Inspect元素显示页面的所有HTML,与使用urllib获取html相同

do something like this 做这样的事情

import urllib
from bs4 import BeautifulSoup as BS

html = urllib.urlopen(URL).read()

soup = BS(html)

print soup.findAll(tag_name).get_text()

I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X 我想更新Jason S.的答案。我无法在OS X上启动phantomjs

driver = webdriver.PhantomJS()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.

Resolved by answer here by downloading executables 通过下载可执行文件来回答此处

driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")

BeautifulSoup could be used to parse the html document, and extract anything you want. BeautifulSoup可用于解析html文档,并提取您想要的任何内容。 It's not designed for downloading. 它不是为下载而设计的。 You could find the elements you want by it's class and id. 你可以通过它的类和id找到你想要的元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM