简体   繁体   中英

Remove Weird Characters using python

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.

Ex: INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher',' http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg ',' http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg ',' http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg ');

i want to remove only the weird characters and leave all of the normal ones

I tried using the following code to do so:

import fileinput
import string

fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")

for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
    print(line)
    s = line
    printable = set(string.printable)
    filter(lambda x: x in printable, s)
    print(s)
    file.write(s)

but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.

Any advice or tips on how to solve this would be useful

    import string

strg = "'2268032019', Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;:  """
for x in strg:
    if x in string.ascii_letters or x in string.digits or x in acc:
        newstrg += x
print (newstrg)

Output;

'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>

You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.

Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.

You can use regular expressions sub() to do simple string replacements. https://docs.python.org/2/library/re.html#re.sub

# -*- coding: utf-8 -*-

import re

dirty_string = u'©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)

print clean_string
# Outputs
>>> sinformation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM