简体   繁体   中英

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.

Is it unavoidable to parse source code html for it, or there is a better way?

Parsing is required. Don't know if there's a library method. A simple regex:

text = sub(r"<[^>]+>", " ", html)

this requires many improvements, but it's a starting point.

I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.

To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.

With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:

#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup

url  = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text() 

print text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM