[英]Stop pandas from grouping the data when reading csv
I have a problem, when I read a csv file with pandas and assign it the header to be row 0 with the following:我有一个问题,当我读取带有 pandas 的 csv 文件并将其分配为 header 为第 0 行时:
df = pd.read_csv(fileName, header = [0])
The first x columns on each row is being grouped and surrounded by parenthesis.每行的前 x 列被分组并用括号括起来。 For example, if I have the following:
例如,如果我有以下内容:
Version, temp, altitude, oxygen, pressure, gas_temp, NH3, NO2
1,189,2980,489.9,594,345,345,22,00
2,11,33,423,554.9,2345,32,22,01
When I try to print each row from the data frame, I get this:当我尝试从数据框中打印每一行时,我得到了这个:
(1,189,2980,489.9,594) 345 345 22 00
(2,11,33,423,554.9) 2345 32 22 01
And if I call df['temp']
I get back [345, 2345]
which is not correct because pandas is grouping the first x columns together.如果我调用
df['temp']
我会返回[345, 2345]
这是不正确的,因为 pandas 将前 x 列组合在一起。
The firsr important discrepancy in your data sample is that it:您的数据样本中的第一个重要差异是:
Another disorder is that column names should be separated only with commas, but your input contains also spaces.另一个问题是列名只能用逗号分隔,但您的输入也包含空格。
The cumulated effect of the above is that read_csv is "fooled" and can read data in such a weird way.上面的累积效果就是read_csv被“愚弄”了,可以用这种怪异的方式读取数据。 Actually, at the first attempt I replicated your case, but the next time (and all following times) I got an apparently proper result.
实际上,在第一次尝试时,我复制了您的案例,但下一次(以及随后的所有时间)我得到了一个明显正确的结果。
The reason I wrote "apparently" is that when you print df.columns
, you wil see another flaw:我写“显然”的原因是当你打印
df.columns
时,你会看到另一个缺陷:
Index(['Version', ' temp', ' altitude', ' oxygen', ' pressure', ' gas_temp',
' NH3', ' NO2'],
dtype='object')
ie column names contain initial spaces , so an attempt to refer eg to temp column throws exception:即列名包含初始空格,因此尝试引用例如临时列会引发异常:
AttributeError: 'DataFrame' object has no attribute 'temp'
One thing you can do to read this file correctly is to pass skipinitialspace=True parameter:要正确读取此文件,您可以做的一件事是传递skipinitialspace=True参数:
df = pd.read_csv(fileName, skipinitialspace=True, header=[0])
The result of reading is:读取结果是:
Version temp altitude oxygen pressure gas_temp NH3 NO2
1 189 2980 489.9 594.0 345 345 22 0
2 11 33 423.0 554.9 2345 32 22 1
When you print df.columns
, you will see that column names are without initial spaces, so now these extra spaces in the header row are stripped.当您打印
df.columns
时,您会看到列名没有初始空格,因此现在 header 行中的这些额外空格已被删除。
Another detail in the way how read_csv operates is that it matches column names with data columns from the end , so the "additional" column (in data rows) is taken as the index column, with no name. read_csv操作方式的另一个细节是它将列名与末尾的数据列匹配,因此“附加”列(在数据行中)被视为索引列,没有名称。
You can also add index_col=[0] parameter:您还可以添加index_col=[0]参数:
df = pd.read_csv(filename, skipinitialspace=True, index_col=[0], header=[0])
to specify explicitely that the initial column is the index.明确指定初始列是索引。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.