熊猫：CSV输入的列与“名称”字段中定义的列不同

Question

I'm using Python Pandas to read a CSV file: 我正在使用Python Pandas读取CSV文件：

col1\tcol2\tcol3\tcol4\tcol5

So in principle this file contains one row and 5 columns separated with a tabulator '\\t'. 因此，原则上该文件包含一行和5列，并用制表符'\\ t'分隔。

While reading the file, I specify a list of names, like so (I assume my file should have 3 columns, not 5 as the file above): 读取文件时，我指定了一个名称列表，如下所示（我假设我的文件应具有3列，而不是上面的文件中的5列）：

df = pd.read_csv("test.txt", sep = "\t", names = ["COL1", "COL2", "COL3"])

Panda doesn't complain about it and in fact, when I print it, it takes first 3 columns and reads it as one, first column, so when I print the DataFrame I get the following: Panda并没有抱怨，实际上，当我打印它时，它需要前三列并将其读为第一列，因此当我打印DataFrame时，我得到以下信息：

print(df.head())
                COL1    COL2    COL3
col1    col2    col3    col4    col5

To me this means that the file is wrongly formatted, but I don't know how to catch this programmatic, eg when I check for the size of the columns, it returns 3 (the number of columns I have defined) and if I check the shape of the DataFrame, it also returns column number of 3. 对我来说，这意味着文件格式错误，但我不知道该如何以编程方式捕获它，例如，当我检查列的大小时，它返回3（我定义的列数），并且如果我检查DataFrame的形状，它还返回列号3。

My question is, how can I detect that the file I try to load with read_csv contains a certain number of columns? 我的问题是，如何检测尝试通过read_csv加载的文件包含一定数量的列？ Of course I could just read the first line of the fine in a traditional way, parse it and check what it is, but is there a way to do this with Pandas? 当然，我可以用传统的方式阅读罚款的第一行，进行解析并检查其内容，但是有没有办法用熊猫来做到这一点？

Answer 1

I think there is nothing wrong. 我认为没有错。 Pandas assumes there are only three columns, because you just gave 3 names. 熊猫假设只有三列，因为您只给了3个名字。

If I for example do: 例如，如果我这样做：

import io
raw="""col1\tcol2\tcol3\tcol4\tcol5
1\t2\t3\t4\t5"""
df= pd.read_csv(io.StringIO(raw), sep='\t')

I get 我懂了

Out[545]: 
   col1  col2  col3  col4  col5
0     1     2     3     4     5

However, if I set the names of three columns like in your example, I get: 但是，如果像您的示例一样设置三列的名称，则会得到：

df= pd.read_csv(io.StringIO(raw), sep='\t', names = ["COL1", "COL2", "COL3"])
Out[547]: 
           COL1  COL2  COL3
col1 col2  col3  col4  col5
1    2        3     4     5

So now it depends on what you actually want to do. 因此，现在取决于您实际想要做什么。 If you want to skip the header and just read the first three columns, you can do: 如果要跳过标题而只阅读前三列，则可以执行以下操作：

df= pd.read_csv(io.StringIO(raw), sep='\t', usecols=range(3), names = ["COL1", "COL2", "COL3"], skiprows=1)

Out[549]: 
   COL1  COL2  COL3
0     1     2     3

If you rather want to read all and replace the names of the first three columns, you could do it like this: 如果您想阅读全部内容并替换前三列的名称，则可以这样做：

df= pd.read_csv(io.StringIO(raw), sep='\t')
df.columns= ["COL1", "COL2", "COL3"] + list(df.columns)[3:]

熊猫：CSV输入的列与“名称”字段中定义的列不同

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-09-01 20:17:11

熊猫：CSV输入的列与“名称”字段中定义的列不同

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-09-01 20:17:11

解决方案1
2 已采纳 2019-09-01 20:17:11