Replace or remove END OF TRANSMISSION BLOCK with Python 2.7

Question

I try to import data from a database encodet in "latin1", change to "unicode" and import them into my app. Normaly this is no problem. But now I have some new data with a field with a strange character = "\\x17"

How do I deal with this in Python?

What I made now is a function for replacing this data. But I think there are much better ways then this:

def replace_problem_characters(self, text):
    replace_store = {u"\x17" : ""}
    for key, value in replace_store.items():
        if key in text:
            text = text.replace(key, value)
    return text

Answer 1

If the database is encoded in "latin", why are you messing with utf-8? Note that in line 4 of your code snippet text is presumed to be encoded in latin but in line 5 the fixed record ends up encoded in utf-8.

When accessing text columns in your database: 1. If not done for you, immediately decode from latin into Unicode. 2. Process your text using Unicode methods. 3. If not done for you, encode your Unicode text into latin just before updating the database.

For data like names, you are highly likely not to want any of the 32 C0 controls (\\x00 up to \\x1f).

If your database is truly latin aka latin_1 aka ISI-8859-1, you don't want the 32 C1 controls (\\x80 up to \\x9f). However if you find that you are having these in your database, then it is likely that you should have been using cp1252 or similar which treats \\x80 up to \\x9f as valid data points with more accented letters and punctuation.

And in any case it would be a lot better if the database was encoded in utf-8, and if you could use Python 3.x instead of 2.7.

Replace or remove END OF TRANSMISSION BLOCK with Python 2.7

Question

1 answers

solution1
0 2016-12-18 04:47:50

Replace or remove END OF TRANSMISSION BLOCK with Python 2.7

Question

1 answers

solution1 0 2016-12-18 04:47:50

solution1
0 2016-12-18 04:47:50