简体   繁体   中英

Replace or remove END OF TRANSMISSION BLOCK with Python 2.7

I try to import data from a database encodet in "latin1", change to "unicode" and import them into my app. Normaly this is no problem. But now I have some new data with a field with a strange character = "\\x17"

How do I deal with this in Python?

What I made now is a function for replacing this data. But I think there are much better ways then this:

def replace_problem_characters(self, text):
    replace_store = {u"\x17" : ""}
    for key, value in replace_store.items():
        if key in text:
            text = text.replace(key, value)
    return text

If the database is encoded in "latin", why are you messing with utf-8? Note that in line 4 of your code snippet text is presumed to be encoded in latin but in line 5 the fixed record ends up encoded in utf-8.

When accessing text columns in your database: 1. If not done for you, immediately decode from latin into Unicode. 2. Process your text using Unicode methods. 3. If not done for you, encode your Unicode text into latin just before updating the database.

For data like names, you are highly likely not to want any of the 32 C0 controls (\\x00 up to \\x1f).

If your database is truly latin aka latin_1 aka ISI-8859-1, you don't want the 32 C1 controls (\\x80 up to \\x9f). However if you find that you are having these in your database, then it is likely that you should have been using cp1252 or similar which treats \\x80 up to \\x9f as valid data points with more accented letters and punctuation.

And in any case it would be a lot better if the database was encoded in utf-8, and if you could use Python 3.x instead of 2.7.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM