简体   繁体   English

非ASCII字符消除而无需更改字符数

[英]Non-ASCII character elimination without changing char count

I have data in fixed-width format I'd like to convert to CSV/tab delimited in python using only ASCII characters. 我有固定宽度格式的数据,我想只使用ASCII字符转换为python /在python中定界的制表符。 I know very little about encodings, and some of the characters in the original file are non-ASCII. 我对编码了解得很少,原始文件中的某些字符是非ASCII的。 I can replace these with placeholders easily enough (I don't really care what they are), but this throws the character count off. 我可以很容易地用占位符替换这些字符(我不在乎它们是什么),但这会使字符计数减少。 I've tried subsequently replacing each sequence of more than 1 placeholder with a single placeholder, but there are some situations where the special characters occur in sequence. 随后,我尝试使用单个占位符替换多个1个以上占位符的每个序列,但是在某些情况下会依次出现特殊字符。

I don't know what encoding was used for the original file, but I wouldn't be surprised if it was copy/pasted from MS word and features characters like ½, « etc. 我不知道原始文件使用了哪种编码,但是如果它是从MS单词复制/粘贴的,并且具有½,«等字符,我不会感到惊讶。

For example, consider the following file test.txt which contains fields of length 1, 2 and 1, separated by a space (including trailing new line): 例如,考虑以下文件test.txt,其中包含长度为1、2和1的字段,这些字段之间用空格(包括尾随新行)分隔:

1 AA A
2 BB B
3 ¾  C
4 «¾ D
5 C  E

The simple python script: 简单的python脚本:

with open('./test.txt', 'r') as f:
    for line in f:
        print len(line)

outputs 输出

7
7
8
9
7

I've tried replacing the offending characters, but since they're read as two, this results in two placeholders inserted. 我尝试替换有问题的字符,但是由于将它们读取为两个,因此会插入两个占位符。 I can then replace multiple placeholders with a single placeholder... but then consecutive placeholders throw the count. 然后,我可以用单个占位符替换多个占位符...但是随后连续的占位符将计数。

import re
r = re.compile(r'\?\?+')

with open('./test.txt', 'r') as f, \
   open('./test_out1.txt', 'w') as w1, \
   open('./test_out2.txt', 'w') as w2:
  for line in f:
    q1 = line.decode('ascii', 'replace').replace(u'\ufffd', '?')
    w1.write(q1)
    q2 = r.sub('?', q1)
    w2.write(q2)

Results: test_out1.txt 结果:test_out1.txt

1 AA A
2 BB B
3 ??  C
4 ???? D
5 C  E

test2.txt 的test2.txt

1 AA A
2 BB B
3 ?  C
4 ? D
5 C  E

This obviously will also have issues if there's ever an actual '?' 如果存在实际的“?”,这显然也会有问题。 character next to a non-ASCII character in the source. 源中非ASCII字符旁边的字符。

Am I missing a really simple way to do this? 我是否错过了一种非常简单的方法来做到这一点?

Thanks in advance. 提前致谢。

Given that your simple python script outputs different line lengths, you are dealing with a multi-byte encoding of some description. 鉴于您的简单python脚本输出的行长不同,因此您正在处理某种描述的多字节编码。

The best approach would be to determine the encoding of the file. 最好的方法是确定文件的编码。 If the data is supposed to be fixed-width, this will be an encoding where every line is the same number of characters (as opposed to bytes). 如果假定数据是固定宽度的,则这将是一种编码,其中每一行都是相同数量的字符 (而不是字节)。

For example: 例如:

$ cat test.txt
1 AA A
2 BB B
3 ¾  C
4 «¾ D
5 C  E

$ python3
Python 3.5.0
>>> with open("test.txt", "r", encoding="utf-8") as f:
...     for line in f:
...         print(len(line))
... 
7
7
7
7
7

If you get different lengths with utf-8 , try other multi-byte encodings until you find the right one. 如果utf-8长度不同,请尝试其他多字节编码,直到找到正确的编码为止。 Once you've determined the input encoding, you can easily output the file with the non-ASCII characters replaced with placeholders: 确定输入编码后,您可以轻松地输出文件,并用占位符替换非ASCII字符:

$ python3
Python 3.5.0
>>> with open("test.txt", "r", encoding="utf-8") as infile:
...     with open("output.txt", "w", encoding="ascii", errors="replace") as outfile:
...         for line in infile:
...             outfile.write(line)

$ cat output.txt 
1 AA A
2 BB B
3 ?  C
4 ?? D
5 C  E

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM