[英]Importing a CSV with random number of columns
I have a script that reads some measurement data in CSV form, and then does all kinds of plotting and stuff with it. 我有一个脚本,可以读取CSV格式的一些测量数据,然后使用它进行各种绘图和填充。
Now I have a new dataset, where some idiot deemed it helpful to add some random comments at the end of the line, like so: 现在,我有了一个新的数据集,一些白痴认为在行尾添加一些随机注释会有所帮助,如下所示:
01.02.1988 00:00:00 ; 204.94
01.03.1988 00:00:00 ; 204.87 ; something
01.04.1988 00:00:00 ; 205.41
01.05.1988 00:00:00 ; 205.64 ; something ; something else
01.06.1988 00:00:00 ; 205.59 ; also something
01.07.1988 00:00:00 ; 205.24
which gives me a nice 这给我很好
ValueError: Expected 2 fields in line 36, saw 3
and so on. 等等。
According to this and this I have to use the names=['whatever','else']
argument when reading it. 根据这个和这个,我在读取它时必须使用
names=['whatever','else']
参数。
But somehow this goes all kinds of wrong. 但是以某种方式,这会带来各种各样的错误。 So here's some examples:
所以这是一些例子:
CSV file CSV文件
Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00 ; 204.87
01.02.1988 00:00:00 ; 204.94
01.03.1988 00:00:00 ; 204.87
The "nice" header is obviously "handmade", but I should just be able to skip it!? “ nice”标头显然是“ handmade”,但是我应该能够跳过它!
CSV reader CSV阅读器
ValReader = pd.read_csv(csv_list[counter],sep=r'\s*;',skiprows=DateStart,names=['Date','level','crap1','crap2','crap3','crap4','crap5','crap6'],usecols=['Date','level'],index_col='Date',dayfirst=True,parse_dates=True)
What I get 我得到什么
print 'ValReader'
level
Date
Date level
01.04.2003 00:00:00 200.76
01.05.2003 00:00:00 200.64
01.06.2003 00:00:00 200.53
Which following that, causes level to get handled as string. 紧随其后的是,导致级别作为字符串处理。
OK, easy, that manual header line in the CSV (which worked well in a previous version, that only had to handle good data) is the culprit, so I just set skiprows
to skiprows=DateStart+1
, but that results in 好,很容易,罪魁祸首是CSV中的手动标题行(在以前的版本中效果很好,只需要处理好数据),因此我只将
skiprows
设置为skiprows=DateStart+1
,但这导致
ValueError: Number of passed names did not match number of header fields in the file
So obviously I got utterly lost in how pandas handles the names and positions of columns. 所以很明显,我完全不知道熊猫如何处理列的名称和位置。
I used to have this issue as well, but here is a solution. 我以前也有这个问题,但这是一个解决方案。
One way to resolve it is to NOT use regex to parse the separator as this falls back to python engine, whereas in C engine, you can skip the bad lines warning, and you can specify which columns you want. 解决它的一种方法是不要使用正则表达式来解析分隔符,因为分隔符将返回到python引擎,而在C引擎中,您可以跳过不良行警告,并可以指定所需的列 。
For example: 例如:
In [1]: import io
In [2]: import pandas as pd
In [3]: s = io.StringIO(u'''Stuff
more stuff I dont need
Date;level;crap1;crap2;crap3;crap4;crap5;crap6
01.01.1988 00:00:00 ; 204.87
01.02.1988 00:00:00 ; 204.94
01.03.1988 00:00:00 ; 204.87 ''')
# I use skiprows=2 instead of DateStart here
# after settings error_bad_lines=False, you can parse the csv OK...
In [4]: ValReader = pd.read_csv(s, sep=';', skiprows=2, usecols=['Date', 'level'],
index_col='Date', dayfirst=True, parse_dates=True,
error_bad_lines=False)
In [5]: ValReader
Out[5]:
level
Date
1988-01-01 204.87
1988-02-01 204.94
1988-03-01 204.87
In [6]: ValReader['level'].dtype
Out[6]: dtype('float64')
Hope this helps for the issues you have. 希望这对您遇到的问题有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.