简体   繁体   中英

python remove ctrl-character from string

I have a bunch of XML files dumped to disk in batches. When I tried to prase them I found that some hade a control character inserted into an attribute.

It looked like this:

<root ^KIND="A"></root>

When it was supposed to look like this:

<root KIND="A"></root>

Now in this case it was easily fixed, just some regexp magic:

import re
xml = re.sub(r'<([^>]*)\v([^>]*)>', r'<\1K\2>', xml)

But then the requirements changed, I had to dump the docs out to disk, individually. Naturally I raw the substitution before saving so i wouldn't have that problem again.

There are alot of these documents you see, many millions...

And so, I was getting ready to extract some data from them again.

This time however I got a new error:

<root KIND="A"><CLASSIFICATION></CLASSIFICATIO^N></root>

When it was supposed to look like this:

<root KIND="A"><CLASSIFICATION></CLASSIFICATION></root>

I am not sure why I keep getting these errors not why its always 'ctrl-characters` that are inserted. It migth be that its pure luck so far.

The regexp I used in hte first case wont wore in general, ^K translates to vertical tab so I could match agains that. But is there some what to filter out any ctrl-character?

Try using a translate table to get rid of ctrl-A through ctrl-Z:

in_chars = ''.join([chr(x) for x in range(1, 27)])
out_chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
tr_table = str.maketrans(in_chars, out_chars)

# pass all strings through the translate table:
x = input('Enter text: ')
print(x.translate(tr_table))

Prints:

Enter text: abc^Kdef
abcKdef

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM