[英]Trying to get Get text out of div class using selenium in python
HTML div class that contains the data I wish to print HTML div类,其中包含我要打印的数据
<div class="gs_a">LR Binford - American antiquity, 1980 - cambridge.org </div>
This is my code so far : 到目前为止,这是我的代码:
from selenium import webdriver
def Author (SearchVar):
driver = webdriver.Chrome("/Users/tutau/Downloads/chromedriver")
driver.get ("https://scholar.google.com/")
SearchBox = driver.find_element_by_id ("gs_hdr_tsi")
SearchBox.send_keys(SearchVar)
SearchBox.submit()
At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')
print (At)
Author("dog")
All that comes out when I print is 我打印时所有显示的信息是
selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")
selenium.webdriver.remote.webelement.WebElement(session =“ 9aa956e2bd51f510dd626f6937b01c0e”,element =“ 0.6506218589189958-1”)
not the text I am new to selenium Help is appreciated 不是我刚接触硒的文字,不胜感激
Intro 介绍
First, I recommend to css-select your target on selenium's page_source
using a faster parser. 首先,我建议使用更快的解析器在
page_source
的page_source
上CSS选择目标。
import lxml
import lxml.html
# put this below SearchBox.submit()
CSS_SELECTOR = '#gs_res_ccl_mid > :nth-child(1) > .gs_ri > .gs_a' # Define css
source = driver.page_source # Get all html
At_raw = lxml.html.document_fromstring(source) # Convert
At = At_raw.cssselect(CSS_SELECTOR) # Select by CSS
Solution 1 解决方案1
Then, you need to extract the text_content()
from your web element and encode it properly. 然后,您需要从Web元素中提取
text_content()
并对其进行正确编码。
At = At.text_content().encode('utf-8') # Get text and encode
print At
Solution 2 解决方案2
In case At
contains more than one line and unicode, you can also remove those: 如果
At
包含多个行和unicode,则您也可以删除以下内容:
At = [l.replace(r'[^\x00-\x7F]+','') for line in At \ # replace unicode
for l in line.text_content().strip().encode('utf-8').splitlines() \ # Get text
if l.strip()] # only consider if line contains characters
print At
Seems you were almost there. 好像您快到了。 Perhaps, as per the HTML and your code trials you have shared, you are seeing the desired output.
也许,按照共享的HTML和代码试验 ,您会看到所需的输出。
Once the following line of code gets executed: 一旦执行以下代码行:
At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')
WebElement At refers to the desired element(single element in your list). WebElement At指所需的元素(列表中的单个元素)。 In your next step, as you invoked
print (At)
the WebElement At is printed which is as follows: 在下一步中,在调用
print (At)
将print (At)
WebElement At ,如下所示:
selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")
Now, as per your question, if you want to extract the text LR Binford - American antiquity, 1980 - cambridge.org , you have to invoke either of the methods through the element: 现在,根据您的问题,如果要提取文本LR Binford-American antiquity,1980-cambridge.org ,则必须通过元素调用以下任一方法:
text
: Gets the text of the element. text
:获取元素的文本。 get_attribute(attributeName)
: Gets the given attribute or property of the element. get_attribute(attributeName)
:获取元素的给定属性或属性。 So you need to change the line of code from: 因此,您需要从以下位置更改代码行:
print (At)
To either of the following: 符合以下任一条件:
Using text
: 使用
text
:
print(At.text)
Using get_attribute(attributeName)
: 使用
get_attribute(attributeName)
:
print(At.get_attribute("innerHTML"))
Your own code with minor adjustments: 您自己的代码,进行了细微调整:
# -*- coding: UTF-8 -*- from selenium import webdriver def Author (SearchVar): options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument('disable-infobars') driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Utility\\BrowserDrivers\\chromedriver.exe') driver.get ("https://scholar.google.com/") SearchBox = driver.find_element_by_name("q") SearchBox.send_keys(SearchVar) SearchBox.submit() At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a') for item in At: print(item.text) Author("dog")
Console Output: 控制台输出:
…, RJ Marles, LS Pellicore, GI Giancaspro, TL Dog - Drug Safety, 2008 - Springer
You are printing the element. 您正在打印元素。 Print ( At.text ) instead of At .
打印( At.text )而不是At 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.