简体   繁体   English

使用Selenium从网页获取所有可见文本

[英]Getting all visible text from a webpage using Selenium

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered. 我整天一直在搜寻,找不到答案,因此,如果已经回答了,请提前道歉。

I'm trying to get all visible text from a large number of different websites. 我正在尝试从大量不同的网站获取所有可见的文本。 The reason is that I want to process the text to eventually categorize the websites. 原因是我要处理文本以最终对网站进行分类。

After a couple of days of research, I decided that Selenium was my best chance. 经过几天的研究,我认为硒是我最好的机会。 I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times: 我发现了一种使用Selenium来捕获所有文本的方法,不幸的是同一文本被多次捕获:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. for循环中的if条件是为了消除多次读取相同文本的问题的尝试-但是,它不能仅按计划在某些网页上工作。 (it also makes the script A LOT slower) (这也使脚本慢很多)

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question. 我猜想我的问题的原因是-当要求元素的内部文本时,我还会得到嵌套在相关元素内部的元素的内部文本。

Is there any way around this? 有没有办法解决? Is there some sort of master element I grab the inner text of? 我是否掌握某种内部元素? Or a completely different way that would enable me to reach my goal? 还是完全不同的方式可以使我实现自己的目标? Any help would be greatly appreciated as I'm out of ideas for this one. 任何帮助都将不胜感激,因为我对此一无所知。

Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text 编辑:之所以使用Selenium而不是机械化和美丽的汤是因为我想要JavaScript招标文本

Using lxml , you might try something like this: 使用lxml ,您可以尝试如下操作:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'\n') 

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps). 除了图像中的文本以及随时间变化的某些文本(可能使用JavaScript并刷新)之外,这似乎可以获取www.yahoo.com上几乎所有的文本。

Here's a variation on @unutbu's answer : 这是@unutbu的答案的变体:

#!/usr/bin/env python
import sys
from contextlib import closing

import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean        import Cleaner
from selenium.webdriver     import Firefox         # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug

cache = FileSystemCache('.cachedir', threshold=100000)

url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"


# get page
page_source = cache.get(url)
if page_source is None:
    # use firefox to get page with javascript generated content
    with closing(Firefox()) as browser:
        browser.get(url)
        page_source = browser.page_source
    cache.set(url, page_source, timeout=60*60*24*7) # week in seconds


# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text

I've separated your task in two: 我将您的任务分为两个部分:

  • get page (including elements generated by javascript) 获取页面(包括由javascript生成的元素)
  • extract text 提取文字

The code is connected only through the cache. 该代码仅通过缓存连接。 You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm. 您可以在一个过程中获取页面,而在另一个过程中提取文本,或推迟以后使用另一种算法进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM