简体   繁体   中英

Delete special characters using linux or python

I am copying a csv file of the following form into postgres:

 0   "the"
 1   "parative Philosophy 62 June 2007 pp 125130 More on Jonas and Process Philosophy in The Legacy of Hans Jonas Judaism and the Phenomenon of Life Edited by Havakp TiroschSamuelson"

When copying this csv file into postgres I am getting the following error:

copy dict from '/home/r.csv' with delimiter E'\t';
ERROR:  invalid byte sequence for encoding "UTF8": 0x00

I tried to remove the special characters using "sed s/\\/\\g' ./r.csv ". However, the special characters are not getting deleted. Is there some way that I may delete the special characters using linux or python

My operating system is ubuntu 12.04 lts.

I'm willing to bet the problem is that the file is actually UTF-16-LE, not UTF-8.

A string of ASCII characters like "abc" , when encoded as UTF-16-LE then decoded as UTF-8, will look like "a\\0b\\0c\\0" , causing exactly this kind of error.

But the solution is not to strip out the \\0 nul bytes. That will appear to work as long as your data are all ASCII (or all ASCII plus a certain subset of Latin-1), but it will give you either garbage or errors as soon as it's anything else. For example, the CJK character U+5000 ( '倀' ) encoded as UTF-16-LE then decoded as UTF-8 looks like '\\0P' , and you certainly don't want to strip out the nul byte and turn that into 'P' . (For that matter, you don't want to interpret U+5050, '偐' , as 'PP' .)

The right thing to do is to recode the file. For example:

iconv -f UTF-16-LE -t UTF-8 r.csv >r8.csv

Not every installation of iconv supports the same names, and I don't know which of the names are the canonical ones. iconv --list |grep -i utf should give you a list of names, and it should be obvious which one(s) mean UTF-16-LE and which UTF-8, so you can pick the appropriate one.

Of course not every system comes with iconv ; you may need to use a different tool instead. If worst comes to worst, you can always write one in a few lines of Python.

If you don't want to figure out where these nul bytes came from, and would rather just get rid of them and cross your fingers:

I don't believe there's anything in either GNU sed or BSD sed that lets you specify any special characters besides \\n for newline. There are lots of ways to get a literal nul byte into the argument to sed … but I'll bet sed will just treat that as the end of the string anyway.

Rather than fighting with sed , let's do it in Python. No need for regexps, just plain str.replace . If the file is small enough that it's no problem to read it into memory:

with open('r.csv', 'rb') as fin, open('r2.csv', 'wb') as fout:
    fout.write(fin.read().replace('\0', ''))

… if it's too big for that, but it's close enough to valid ASCII that it makes sense to think of it as lines:

with open('r.csv', 'rb') as fin, open('r2.csv', 'wb') as fout:
    for line in fin:
        fout.write(line.replace('\0', ''))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM