简体   繁体   English

在读取 csv 时停止 pandas 对数据进行分组

[英]Stop pandas from grouping the data when reading csv

I have a problem, when I read a csv file with pandas and assign it the header to be row 0 with the following:我有一个问题,当我读取带有 pandas 的 csv 文件并将其分配为 header 为第 0 行时:

df = pd.read_csv(fileName, header = [0])

The first x columns on each row is being grouped and surrounded by parenthesis.每行的前 x 列被分组并用括号括起来。 For example, if I have the following:例如,如果我有以下内容:

Version, temp, altitude, oxygen, pressure, gas_temp, NH3, NO2
1,189,2980,489.9,594,345,345,22,00
2,11,33,423,554.9,2345,32,22,01

When I try to print each row from the data frame, I get this:当我尝试从数据框中打印每一行时,我得到了这个:

(1,189,2980,489.9,594) 345 345 22 00
(2,11,33,423,554.9) 2345 32 22 01

And if I call df['temp'] I get back [345, 2345] which is not correct because pandas is grouping the first x columns together.如果我调用df['temp']我会返回[345, 2345]这是不正确的,因为 pandas 将前 x 列组合在一起。

The firsr important discrepancy in your data sample is that it:您的数据样本中的第一个重要差异是:

  • contains only 8 column names,仅包含8个列名,
  • but there are 9 data columns.但是有9个数据列。

Another disorder is that column names should be separated only with commas, but your input contains also spaces.另一个问题是列名只能用逗号分隔,但您的输入也包含空格。

The cumulated effect of the above is that read_csv is "fooled" and can read data in such a weird way.上面的累积效果就是read_csv被“愚弄”了,可以用这种怪异的方式读取数据。 Actually, at the first attempt I replicated your case, but the next time (and all following times) I got an apparently proper result.实际上,在第一次尝试时,我复制了您的案例,但下一次(以及随后的所有时间)我得到了一个明显正确的结果。

The reason I wrote "apparently" is that when you print df.columns , you wil see another flaw:我写“显然”的原因是当你打印df.columns时,你会看到另一个缺陷:

Index(['Version', ' temp', ' altitude', ' oxygen', ' pressure', ' gas_temp',
       ' NH3', ' NO2'],
      dtype='object')

ie column names contain initial spaces , so an attempt to refer eg to temp column throws exception:即列名包含初始空格,因此尝试引用例如临时列会引发异常:

AttributeError: 'DataFrame' object has no attribute 'temp'

One thing you can do to read this file correctly is to pass skipinitialspace=True parameter:要正确读取此文件,您可以做的一件事是传递skipinitialspace=True参数:

df = pd.read_csv(fileName, skipinitialspace=True, header=[0])

The result of reading is:读取结果是:

   Version  temp  altitude  oxygen  pressure  gas_temp  NH3  NO2
1      189  2980     489.9   594.0       345       345   22    0
2       11    33     423.0   554.9      2345        32   22    1

When you print df.columns , you will see that column names are without initial spaces, so now these extra spaces in the header row are stripped.当您打印df.columns时,您会看到列名没有初始空格,因此现在 header 行中的这些额外空格已被删除。

Another detail in the way how read_csv operates is that it matches column names with data columns from the end , so the "additional" column (in data rows) is taken as the index column, with no name. read_csv操作方式的另一个细节是它将列名与末尾的数据列匹配,因此“附加”列(在数据行中)被视为索引列,没有名称。

You can also add index_col=[0] parameter:您还可以添加index_col=[0]参数:

df = pd.read_csv(filename, skipinitialspace=True, index_col=[0], header=[0])

to specify explicitely that the initial column is the index.明确指定初始列是索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM