Remove html formatting “>” from text file using Python csv.reader

Question

I have a text file with ; used as the delimiter. The problem is that it has some html text formatting in it such as > Obviously the ; in this causes problems. The text file is large and I don't have a list of these html strings, that is there are many different examples such as $amp; . How can I remove all of them using python. The file is a list of names, addresses, phone number and a few more fields. I am looking for the crap.html.remove(textfile) module

Answer 1

The quickest way is probably to use the undocumented but so far stable unescape method in HTMLParser :

import HTMLParser
s= HTMLParser.HTMLParser().unescape(s)

Note this will necessarily output a Unicode string, so if you have any non-ASCII bytes in there you will need to s.decode(encoding) first.

Answer 2

Take a look at the code from here :

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except (ValueError, OverflowError):
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

Of course, this only takes care of HTML entities. You may have other semicolons in the text that mess with your CSV parser. But I guess you already know that...

UPDATE : added catch for possible OverflowError .

Answer 3

On most Unix systems (including your Mac OS X), you can recode the input text file with:

recode html.. file_with_html.txt

This replaces > by ">", etc.

You can call this through Python's subprocess module, for instance.

Remove html formatting “>” from text file using Python csv.reader

Question

3 answers

solution1
6 ACCPTED 2009-10-28 13:41:44

solution2
3 2009-10-28 13:39:17

solution3
1 2010-01-02 10:59:58

Remove html formatting “&gt;” from text file using Python csv.reader

Question

3 answers

solution1 6 ACCPTED 2009-10-28 13:41:44

solution2 3 2009-10-28 13:39:17

solution3 1 2010-01-02 10:59:58

Remove html formatting “>” from text file using Python csv.reader

solution1
6 ACCPTED 2009-10-28 13:41:44

solution2
3 2009-10-28 13:39:17

solution3
1 2010-01-02 10:59:58