导入具有随机列数的CSV

Question

I have a script that reads some measurement data in CSV form, and then does all kinds of plotting and stuff with it. 我有一个脚本，可以读取CSV格式的一些测量数据，然后使用它进行各种绘图和填充。

Now I have a new dataset, where some idiot deemed it helpful to add some random comments at the end of the line, like so: 现在，我有了一个新的数据集，一些白痴认为在行尾添加一些随机注释会有所帮助，如下所示：

01.02.1988 00:00:00   ;   204.94     
01.03.1988 00:00:00   ;   204.87 ; something
01.04.1988 00:00:00   ;   205.41     
01.05.1988 00:00:00   ;   205.64 ; something ; something else    
01.06.1988 00:00:00   ;   205.59 ; also something    
01.07.1988 00:00:00   ;   205.24

which gives me a nice 这给我很好

ValueError: Expected 2 fields in line 36, saw 3

and so on. 等等。

According to this and this I have to use the names=['whatever','else'] argument when reading it. 根据这个和这个，我在读取它时必须使用names=['whatever','else']参数。

But somehow this goes all kinds of wrong. 但是以某种方式，这会带来各种各样的错误。 So here's some examples: 所以这是一些例子：

CSV file CSV文件

Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00   ;   204.87     
01.02.1988 00:00:00   ;   204.94     
01.03.1988 00:00:00   ;   204.87

The "nice" header is obviously "handmade", but I should just be able to skip it!? “ nice”标头显然是“ handmade”，但是我应该能够跳过它！

CSV reader CSV阅读器

ValReader = pd.read_csv(csv_list[counter],sep=r'\s*;',skiprows=DateStart,names=['Date','level','crap1','crap2','crap3','crap4','crap5','crap6'],usecols=['Date','level'],index_col='Date',dayfirst=True,parse_dates=True)

What I get 我得到什么

print 'ValReader'
                         level
Date                          
Date                     level
01.04.2003 00:00:00     200.76
01.05.2003 00:00:00     200.64
01.06.2003 00:00:00     200.53

Which following that, causes level to get handled as string. 紧随其后的是，导致级别作为字符串处理。

OK, easy, that manual header line in the CSV (which worked well in a previous version, that only had to handle good data) is the culprit, so I just set skiprows to skiprows=DateStart+1 , but that results in 好，很容易，罪魁祸首是CSV中的手动标题行（在以前的版本中效果很好，只需要处理好数据），因此我只将skiprows设置为skiprows=DateStart+1 ，但这导致

ValueError: Number of passed names did not match number of header fields in the file

So obviously I got utterly lost in how pandas handles the names and positions of columns. 所以很明显，我完全不知道熊猫如何处理列的名称和位置。

Answer 1

I used to have this issue as well, but here is a solution. 我以前也有这个问题，但这是一个解决方案。

One way to resolve it is to NOT use regex to parse the separator as this falls back to python engine, whereas in C engine, you can skip the bad lines warning, and you can specify which columns you want. 解决它的一种方法是不要使用正则表达式来解析分隔符，因为分隔符将返回到python引擎，而在C引擎中，您可以跳过不良行警告，并可以指定所需的列。

For example: 例如：

In [1]: import io

In [2]: import pandas as pd

In [3]: s = io.StringIO(u'''Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00   ;   204.87
01.02.1988 00:00:00   ;   204.94
01.03.1988 00:00:00   ;   204.87 ''')
# I use skiprows=2 instead of DateStart here
# after settings error_bad_lines=False, you can parse the csv OK...
In [4]: ValReader = pd.read_csv(s, sep=';', skiprows=2, usecols=['Date', 'level'], 
                                index_col='Date', dayfirst=True, parse_dates=True, 
                                error_bad_lines=False)

In [5]: ValReader
Out[5]:
             level
Date
1988-01-01  204.87
1988-02-01  204.94
1988-03-01  204.87

In [6]: ValReader['level'].dtype
Out[6]: dtype('float64')

Hope this helps for the issues you have. 希望这对您遇到的问题有所帮助。

导入具有随机列数的CSV

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-07-02 16:11:06

导入具有随机列数的CSV

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-07-02 16:11:06

解决方案1
0 已采纳 2015-07-02 16:11:06