简体   繁体   English

numpy.genfromtxt具有空字符的csv文件

[英]numpy.genfromtxt csv file with null characters

I'm working on a scientific graphing script, designed to create graphs from csv files output by Agilent's Chemstation software. 我正在研究科学的图形脚本,该脚本旨在从安捷伦Chemstation软件输出的csv文件创建图形。

I got the script working perfectly when the files come from one version of Chemstation (The version for liquid chromatography). 当文件来自Chemstation的一个版本(液相色谱的版本)时,脚本可以正常运行。

Now i'm trying to port it to work on our GC (Gas Chromatography). 现在,我正在尝试将其移植到我们的GC(气相色谱)上。 For some reason, this version of chemstation inserts nulls in between each character in any text file it outputs. 由于某种原因,此版本的chemstation在其输出的任何文本文件的每个字符之间插入空值。

I'm trying to use numpy.genfromtxt to get the x,y data into python in order to create the graphs (using matplotlib). 我正在尝试使用numpy.genfromtxt将x,y数据导入python,以便创建图形(使用matplotlib)。

I originally used: 我最初使用:

data = genfromtxt(directory+signal, delimiter = ',') 

to load the data in. When I do this with a csv file generated by our GC, I get an array of all 'nan' values. 加载数据。当我用GC生成的csv文件执行此操作时,我得到了所有'nan'值的数组。 If I set the dtype to none, I get 'byte strings' that look like this: 如果将dtype设置为none,则会得到如下所示的“字节字符串”:

b'\x00 \x008\x008\x005\x00.\x002\x005\x002\x001\x007\x001\x00\r'

What I need is a float, for the above string it would be 885.252171. 我需要一个浮点数,上面的字符串是885.252171。

Anyone have any idea how I can get where I need to go? 有人知道我如何到达需要去的地方吗?

And just to be clear, I couldn't find any setting on Chemstation that would affect it's output to just not create files with nulls. 只是要清楚一点,我在Chemstation上找不到任何会影响其输出的设置,即不会创建具有null的文件。

Thanks 谢谢

Jeff 杰夫

Given that your file is encoded as utf-16-le with a BOM, and all the actual unicode codepoints (except the BOM) are less than 128, you should be able to use an instance of codecs.EncodedFile to transcode the file from utf-16 to ascii. 假设您的文件使用BOM编码为utf-16-le,并且所有实际的unicode代码点(BOM除外)都小于128,那么您应该能够使用codecs.EncodedFile器实例codecs.EncodedFile从utf转码文件-16为ascii。 The following example works for me. 以下示例对我有用。

Here's my test file: 这是我的测试文件:

$ cat utf_16_le_with_bom.csv 
??2.0,19
1.5,17
2.5,23
1.0,10
3.0,5

The first two bytes, ff and fe are the BOM U+FEFF: 前两个字节fffe是BOM U + FEFF:

$ hexdump utf_16_le_with_bom.csv 
0000000 ff fe 32 00 2e 00 30 00 2c 00 31 00 39 00 0a 00
0000010 31 00 2e 00 35 00 2c 00 31 00 37 00 0a 00 32 00
0000020 2e 00 35 00 2c 00 32 00 33 00 0a 00 31 00 2e 00
0000030 30 00 2c 00 31 00 30 00 0a 00 33 00 2e 00 30 00
0000040 2c 00 35 00 0a 00                              
0000046

Here's the python script genfromtxt_utf16.py (updated for Python 3): 这是python脚本genfromtxt_utf16.py (已针对Python 3更新):

import codecs
import numpy as np

fh = open('utf_16_le_with_bom.csv', 'rb')
efh = codecs.EncodedFile(fh, data_encoding='ascii', file_encoding='utf-16')
a = np.genfromtxt(efh, delimiter=',')
fh.close()

print("a:")
print(a)

With python 3.4.1 and numpy 1.8.1, the script works: 使用python 3.4.1和numpy 1.8.1时,脚本可以运行:

$ python3.4 genfromtxt_utf16.py 
a:
[[  2.   19. ]
 [  1.5  17. ]
 [  2.5  23. ]
 [  1.   10. ]
 [  3.    5. ]]

Be sure that you don't specify the encoding as file_encoding='utf-16-le' . 确保您未将编码指定为file_encoding='utf-16-le' If the endian suffix is included, the BOM is not stripped, and it can't be transcoded to ascii. 如果包含字节序后缀,则不会删除BOM,并且无法将其转码为ascii。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM