简体   繁体   English

Pandas read_csv在具有空字符的列上失败

[英]Pandas read_csv failing on columns with null characters

Column y below should be ['Reg', 'Reg', 'Swp', 'Swp'] 下面的列y应该是['Reg','Reg','Swp','Swp']

In [1]: pd.read_csv('/tmp/test3.csv')  
Out[1]:  
x,y  
 ^@^@^@,Reg  
 ^@^@^@,Reg  
I,Swp  
I,Swp  

In [2]: ! cat /tmp/test3.csv  
     x    y  
0  
1  NaN  NaN  
2    I  Swp  
3    I  Swp    

In [3]: f = open('/tmp/test3.csv', 'rb'); print(repr(f.read()))  
'x,y\n \x00\x00\x00,Reg\n \x00\x00\x00,Reg\nI,Swp\nI,Swp\n'

Yes, I could reproduce the problem, but don't know how to fix it with pd.read_csv . 是的,我可以重现问题,但不知道如何使用pd.read_csv修复它。 Here is a workaround: 这是一个解决方法:

In [46]: import numpy as np
In [47]: arr = np.genfromtxt('test3.csv', delimiter = ',', 
                             dtype = None, names = True)

In [48]: df = pd.DataFrame(arr)

In [49]: df
Out[49]: 
   x    y
0     Reg
1     Reg
2  I  Swp
3  I  Swp

Note that with names = True the first valid line of the csv is interpreted as column names (and therefore does not affect the dtype of the values on the subsequent lines.) Thus, if the csv file contains numerical data such as 请注意,如果使用names = True ,则csv的第一个有效行将被解释为列名(因此不会影响后续行中值的dtype。)因此,如果csv文件包含数字数据,例如

In [22]: with open('/tmp/test.csv','r') as f:
   ....:     print(repr(f.read()))
   ....:     
'x,y,z\n \x00\x00\x00,Reg,1\n \x00\x00\x00,Reg,2\nI,Swp,3\nI,Swp,4\n'

Then genfromtxt will assign a numerical dtype to the third column ( <i4 in this case). 然后genfromtxt将数字dtype分配给第三列(在这种情况下为<i4 )。

In [19]: arr = np.genfromtxt('/tmp/test.csv', delimiter = ',', dtype = None, names = True)

In [20]: arr
Out[20]: 
array([('', 'Reg', 1), ('', 'Reg', 2), ('I', 'Swp', 3), ('I', 'Swp', 4)], 
      dtype=[('x', '|S3'), ('y', '|S3'), ('z', '<i4')])

However, if the numerical data is intermingled with bytes such as '\\x00' then genfromtxt will be unable to recognize this column as numerical and will therefore resort to assigning a string dtype. 但是,如果数字数据与诸如'\\x00'字节混合,则genfromtxt将无法将此列识别为数字,因此将使用字符串dtype。 Nevertheless, you can force the dtype of the columns by manually assigning the dtype parameter. 不过,您可以通过手动分配dtype参数来强制列的dtype For example, 例如,

In [11]: arr = np.genfromtxt('/tmp/test.csv', delimiter = ',', dtype = [('x', '|i4'), ('y', '|S3')], names = True)

sets the first column x to have dtype |i4 (4-byte integers) and the second column y to have dtype |S3 (3-byte string). 将第一列x为具有dtype |i4 (4字节整数),将第二列y为具有dtype |S3 (3字节字符串)。 See this doc page for more information on available dtypes. 有关可用dtypes的更多信息,请参阅此文档页面

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM