简体   繁体   English

导入具有随机列数的CSV

[英]Importing a CSV with random number of columns

I have a script that reads some measurement data in CSV form, and then does all kinds of plotting and stuff with it. 我有一个脚本,可以读取CSV格式的一些测量数据,然后使用它进行各种绘图和填充。

Now I have a new dataset, where some idiot deemed it helpful to add some random comments at the end of the line, like so: 现在,我有了一个新的数据集,一些白痴认为在行尾添加一些随机注释会有所帮助,如下所示:

01.02.1988 00:00:00   ;   204.94     
01.03.1988 00:00:00   ;   204.87 ; something
01.04.1988 00:00:00   ;   205.41     
01.05.1988 00:00:00   ;   205.64 ; something ; something else    
01.06.1988 00:00:00   ;   205.59 ; also something    
01.07.1988 00:00:00   ;   205.24

which gives me a nice 这给我很好

ValueError: Expected 2 fields in line 36, saw 3

and so on. 等等。

According to this and this I have to use the names=['whatever','else'] argument when reading it. 根据这个这个,我在读取它时必须使用names=['whatever','else']参数。

But somehow this goes all kinds of wrong. 但是以某种方式,这会带来各种各样的错误。 So here's some examples: 所以这是一些例子:

CSV file CSV文件

Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00   ;   204.87     
01.02.1988 00:00:00   ;   204.94     
01.03.1988 00:00:00   ;   204.87 

The "nice" header is obviously "handmade", but I should just be able to skip it!? “ nice”标头显然是“ handmade”,但是我应该能够跳过它!

CSV reader CSV阅读器

ValReader = pd.read_csv(csv_list[counter],sep=r'\s*;',skiprows=DateStart,names=['Date','level','crap1','crap2','crap3','crap4','crap5','crap6'],usecols=['Date','level'],index_col='Date',dayfirst=True,parse_dates=True)

What I get 我得到什么

print 'ValReader'
                         level
Date                          
Date                     level
01.04.2003 00:00:00     200.76
01.05.2003 00:00:00     200.64
01.06.2003 00:00:00     200.53

Which following that, causes level to get handled as string. 紧随其后的是,导致级别作为字符串处理。

OK, easy, that manual header line in the CSV (which worked well in a previous version, that only had to handle good data) is the culprit, so I just set skiprows to skiprows=DateStart+1 , but that results in 好,很容易,罪魁祸首是CSV中的手动标题行(在以前的版本中效果很好,只需要处理好数据),因此我只将skiprows设置为skiprows=DateStart+1 ,但这导致

ValueError: Number of passed names did not match number of header fields in the file

So obviously I got utterly lost in how pandas handles the names and positions of columns. 所以很明显,我完全不知道熊猫如何处理列的名称和位置。

I used to have this issue as well, but here is a solution. 我以前也有这个问题,但这是一个解决方案。

One way to resolve it is to NOT use regex to parse the separator as this falls back to python engine, whereas in C engine, you can skip the bad lines warning, and you can specify which columns you want. 解决它的一种方法是不要使用正则表达式来解析分隔符,因为分隔符将返回到python引擎,而在C引擎中,您可以跳过不良行警告,并可以指定所需的

For example: 例如:

In [1]: import io

In [2]: import pandas as pd

In [3]: s = io.StringIO(u'''Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00   ;   204.87
01.02.1988 00:00:00   ;   204.94
01.03.1988 00:00:00   ;   204.87 ''')
# I use skiprows=2 instead of DateStart here
# after settings error_bad_lines=False, you can parse the csv OK...
In [4]: ValReader = pd.read_csv(s, sep=';', skiprows=2, usecols=['Date', 'level'], 
                                index_col='Date', dayfirst=True, parse_dates=True, 
                                error_bad_lines=False)

In [5]: ValReader
Out[5]:
             level
Date
1988-01-01  204.87
1988-02-01  204.94
1988-03-01  204.87

In [6]: ValReader['level'].dtype
Out[6]: dtype('float64')

Hope this helps for the issues you have. 希望这对您遇到的问题有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM