I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those.
I use Yuri's version, as it is most recent, and seems to be in active development. I managed to make it run on Google App Engine using Python 2.7. Now the "problem" is that it returns HTML, whereas I need pure text.
The advice in this Stackoverflow article about links extraction , is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.
My questions:
You can use html2text. It is a nifty tool.
Here is a link on how to use it with python readability tool - together they are called read2text.
http://brettterpstra.com/scripting-readability-markdownify-for-clipping-web-pages/
Hope this helps :)
Not to let it linger, my current solution
code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
First, you extract the HTML contents with readability,
html_snippet = Document(html).summary()
Then, use a library to remove HTML tags. There are caveats: 1) you probably need spaces, " <p>some text<br>other text
" shouldn't be " some textother text
", and you might need the lists converted into " -
". 2) " #&39;
" should be displayed as " '
", and " >
" should be displayed as " >
" -- this is called HTML entities replacement (see below).
I usually use a library called bleach to clean out unnecessary tags and attributes:
cleaned_text = bleach.clean(html_snippet, tags=[])
or
cleaned_text = bleach.clean(html_snippet, tags=['i', 'b'])
You need to use any kind of html2text library if you want to remove all tags and get a better text formatting, or you can implement custom formatting procedure yourself.
But I think you now got the raw idea.
For a simple text formatting with bleach: For example, if you want paragraphs as " \\n
", and list items as " \\n -
", then:
norm_html = bleach.clean(html_snippet, tags=['p', 'br', 'li'])
replaced_html = norm_html.replace('<p>', '\n').replace('</p>', '\n')
replaced_html = replaced_html.replace('<br>', '\n').replace('<li>', '\n - ')
cleaned_text = bleach.clean(replaced_html, tags=[])
For a regexp that only strips HTML tags and does entities replacement (" >
" should be " >
" and so on), you can take a look at https://stackoverflow.com/a/7778368/217895
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.