在读取 csv 时停止 pandas 对数据进行分组

Question

I have a problem, when I read a csv file with pandas and assign it the header to be row 0 with the following:我有一个问题，当我读取带有 pandas 的 csv 文件并将其分配为 header 为第 0 行时：

df = pd.read_csv(fileName, header = [0])

The first x columns on each row is being grouped and surrounded by parenthesis.每行的前 x 列被分组并用括号括起来。 For example, if I have the following:例如，如果我有以下内容：

Version, temp, altitude, oxygen, pressure, gas_temp, NH3, NO2
1,189,2980,489.9,594,345,345,22,00
2,11,33,423,554.9,2345,32,22,01

When I try to print each row from the data frame, I get this:当我尝试从数据框中打印每一行时，我得到了这个：

(1,189,2980,489.9,594) 345 345 22 00
(2,11,33,423,554.9) 2345 32 22 01

And if I call df['temp'] I get back [345, 2345] which is not correct because pandas is grouping the first x columns together.如果我调用df['temp']我会返回[345, 2345]这是不正确的，因为 pandas 将前 x 列组合在一起。

Answer 1

The firsr important discrepancy in your data sample is that it:您的数据样本中的第一个重要差异是：

contains only 8 column names,仅包含8个列名，
but there are 9 data columns.但是有9个数据列。

Another disorder is that column names should be separated only with commas, but your input contains also spaces.另一个问题是列名只能用逗号分隔，但您的输入也包含空格。

The cumulated effect of the above is that read_csv is "fooled" and can read data in such a weird way.上面的累积效果就是read_csv被“愚弄”了，可以用这种怪异的方式读取数据。 Actually, at the first attempt I replicated your case, but the next time (and all following times) I got an apparently proper result.实际上，在第一次尝试时，我复制了您的案例，但下一次（以及随后的所有时间）我得到了一个明显正确的结果。

The reason I wrote "apparently" is that when you print df.columns , you wil see another flaw:我写“显然”的原因是当你打印df.columns时，你会看到另一个缺陷：

Index(['Version', ' temp', ' altitude', ' oxygen', ' pressure', ' gas_temp',
       ' NH3', ' NO2'],
      dtype='object')

ie column names contain initial spaces , so an attempt to refer eg to temp column throws exception:即列名包含初始空格，因此尝试引用例如临时列会引发异常：

AttributeError: 'DataFrame' object has no attribute 'temp'

One thing you can do to read this file correctly is to pass skipinitialspace=True parameter:要正确读取此文件，您可以做的一件事是传递skipinitialspace=True参数：

df = pd.read_csv(fileName, skipinitialspace=True, header=[0])

The result of reading is:读取结果是：

   Version  temp  altitude  oxygen  pressure  gas_temp  NH3  NO2
1      189  2980     489.9   594.0       345       345   22    0
2       11    33     423.0   554.9      2345        32   22    1

When you print df.columns , you will see that column names are without initial spaces, so now these extra spaces in the header row are stripped.当您打印df.columns时，您会看到列名没有初始空格，因此现在 header 行中的这些额外空格已被删除。

Another detail in the way how read_csv operates is that it matches column names with data columns from the end , so the "additional" column (in data rows) is taken as the index column, with no name. read_csv操作方式的另一个细节是它将列名与末尾的数据列匹配，因此“附加”列（在数据行中）被视为索引列，没有名称。

You can also add index_col=[0] parameter:您还可以添加index_col=[0]参数：

df = pd.read_csv(filename, skipinitialspace=True, index_col=[0], header=[0])

to specify explicitely that the initial column is the index.明确指定初始列是索引。

在读取 csv 时停止 pandas 对数据进行分组

问题描述

1 个解决方案

解决方案1
0 2020-05-29 06:19:48

在读取 csv 时停止 pandas 对数据进行分组

问题描述

1 个解决方案

解决方案1 0 2020-05-29 06:19:48

解决方案1
0 2020-05-29 06:19:48