How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

Question

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.

Is it unavoidable to parse source code html for it, or there is a better way?

Answer 1

Parsing is required. Don't know if there's a library method. A simple regex:

text = sub(r"<[^>]+>", " ", html)

this requires many improvements, but it's a starting point.

Answer 2

I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.

To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.

Answer 3

With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:

#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup

url  = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text() 

print text

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

Question

3 answers

solution1
1 2013-05-07 18:08:53

solution2
1 ACCPTED 2013-05-07 18:12:25

solution3
1 2013-05-07 18:23:19

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

Question

3 answers

solution1 1 2013-05-07 18:08:53

solution2 1 ACCPTED 2013-05-07 18:12:25

solution3 1 2013-05-07 18:23:19

solution1
1 2013-05-07 18:08:53

solution2
1 ACCPTED 2013-05-07 18:12:25

solution3
1 2013-05-07 18:23:19