简体   繁体   English

试图在python中使用硒从div类中获取文本

[英]Trying to get Get text out of div class using selenium in python

HTML div class that contains the data I wish to print HTML div类,其中包含我要打印的数据

在此处输入图片说明

<div class="gs_a">LR Binford&nbsp;- American antiquity, 1980 - cambridge.org </div>

This is my code so far : 到目前为止,这是我的代码:

from selenium import webdriver

def Author (SearchVar):

    driver = webdriver.Chrome("/Users/tutau/Downloads/chromedriver")

    driver.get ("https://scholar.google.com/")

    SearchBox = driver.find_element_by_id ("gs_hdr_tsi")

    SearchBox.send_keys(SearchVar)

    SearchBox.submit()

    At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

    print (At)

Author("dog")

All that comes out when I print is 我打印时所有显示的信息是

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1") selenium.webdriver.remote.webelement.WebElement(session =“ 9aa956e2bd51f510dd626f6937b01c0e”,element =“ 0.6506218589189958-1”)

not the text I am new to selenium Help is appreciated 不是我刚接触硒的文字,不胜感激

Intro 介绍

First, I recommend to css-select your target on selenium's page_source using a faster parser. 首先,我建议使用更快的解析器在page_sourcepage_source上CSS选择目标。

import lxml
import lxml.html

# put this below SearchBox.submit()

CSS_SELECTOR = '#gs_res_ccl_mid > :nth-child(1) > .gs_ri > .gs_a' # Define css
source = driver.page_source                                       # Get all html
At_raw = lxml.html.document_fromstring(source)                    # Convert
At = At_raw.cssselect(CSS_SELECTOR)                               # Select by CSS

Solution 1 解决方案1

Then, you need to extract the text_content() from your web element and encode it properly. 然后,您需要从Web元素中提取text_content()并对其进行正确编码。

At = At.text_content().encode('utf-8') # Get text and encode
print At

Solution 2 解决方案2

In case At contains more than one line and unicode, you can also remove those: 如果At包含多个行和unicode,则您也可以删除以下内容:

At = [l.replace(r'[^\x00-\x7F]+','') for line in At \                 # replace unicode
         for l in line.text_content().strip().encode('utf-8').splitlines() \ # Get text
               if l.strip()]                # only consider if line contains characters
print At

Seems you were almost there. 好像您快到了。 Perhaps, as per the HTML and your code trials you have shared, you are seeing the desired output. 也许,按照共享的HTML代码试验 ,您会看到所需的输出。

Explaination 讲解

Once the following line of code gets executed: 一旦执行以下代码行:

At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

WebElement At refers to the desired element(single element in your list). WebElement At指所需的元素(列表中的单个元素)。 In your next step, as you invoked print (At) the WebElement At is printed which is as follows: 在下一步中,在调用print (At)print (At) WebElement At ,如下所示:

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

Solution

Now, as per your question, if you want to extract the text LR Binford - American antiquity, 1980 - cambridge.org , you have to invoke either of the methods through the element: 现在,根据您的问题,如果要提取文本LR Binford-American antiquity,1980-cambridge.org ,则必须通过元素调用以下任一方法:

So you need to change the line of code from: 因此,您需要从以下位置更改代码行:

print (At)

To either of the following: 符合以下任一条件:

  • Using text : 使用text

     print(At.text) 
  • Using get_attribute(attributeName) : 使用get_attribute(attributeName)

     print(At.get_attribute("innerHTML")) 
  • Your own code with minor adjustments: 您自己的代码,进行了细微调整:

     # -*- coding: UTF-8 -*- from selenium import webdriver def Author (SearchVar): options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument('disable-infobars') driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Utility\\BrowserDrivers\\chromedriver.exe') driver.get ("https://scholar.google.com/") SearchBox = driver.find_element_by_name("q") SearchBox.send_keys(SearchVar) SearchBox.submit() At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a') for item in At: print(item.text) Author("dog") 
  • Console Output: 控制台输出:

     …, RJ Marles, LS Pellicore, GI Giancaspro, TL Dog - Drug Safety, 2008 - Springer 

You are printing the element. 您正在打印元素。 Print ( At.text ) instead of At . 打印( At.text )而不是At

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM