[英]Reading malformed 'csv' file with pandas
I have a malformed "csv" file: 我有一个格式不正确的“ csv”文件:
txt = """NAME;a;b;c
ATTR1;1;2;3
ATTR2;1;2;3;;;
ATTR3;1;2;3;
ATTR4;1;2;3"""
I there a way to use pandas
pd.read_*
toolbox to get the following pd.DataFrame
: 我有一种使用
pandas
pd.read_*
工具箱的方法来获取以下pd.DataFrame
:
|---+-------+---+---+---|
| | 0 | 1 | 2 | 3 |
|---+-------+---+---+---|
| 0 | NAME | a | b | c |
| 1 | ATTR1 | 1 | 2 | 3 |
| 2 | ATTR2 | 1 | 2 | 3 |
| 3 | ATTR3 | 1 | 2 | 3 |
| 4 | ATTR4 | 1 | 2 | 3 |
|---+-------+---+---+---|
? ?
PS I know how to do it with import csv
PS我知道如何
import csv
Thank you for ideas and BR, Lex 感谢您的想法和BR,Lex
EDIT 编辑
This was a toy example from real file (which I again had to modify) ... 这是真实文件中的一个玩具示例(我必须再次对其进行修改)...
SRC = 'https://dl.dropboxusercontent.com/u/40513206/test.csv'
NA_VALUES = ['', '#N/A N/A', '#N/A Field Not Applicable', '#N/A Invalid Field',
'#N/A Invalid Security', '#N/AN/A', '#N/A Limit', '#####', '#DIV/0!',
'#N/A', '#NAME?', '#NULL!', '#NUM!', '#REF!', '#VALUE!']
CSV_ENCODING = 'WINDOWS-1252'
S_ROWS = 6
NR_ROWS = 60
NR_COLS = 52 # correct nr. of columns, but not always known
dat_m = pd.read_csv(SRC, sep = ';', header = None, index_col = None, skiprows = S_ROWS,
nrows = NR_ROWS, encoding = CSV_ENCODING, na_values = NA_VALUES, names = range(NR_COLS))
Seems that if we use names
parameter then NR_COLS
must be >=
actual nr. 似乎如果我们使用
names
参数,则NR_COLS
必须>=
实际nr。 of columns in first row, if not so, then Index
or MultiIndex
is formed (based on actual columns), for example if NR_COLS = 50
then index has 2 levels, if NR_COLS = 49
then 3 levels etc. 第一行中的列数,如果不是这样,则形成
Index
或MultiIndex
(基于实际列),例如,如果NR_COLS = 50
则索引具有2级,如果NR_COLS = 49
则形成3级, NR_COLS = 49
。
All this is a result when I save Excel
to csv
, it seems to add sep = ';'
这是我将
Excel
保存到csv
时的结果,似乎添加了sep = ';'
to some rows and for some other reason I can not use xls
(read) files directly. 到某些行,并且由于其他原因,我无法直接使用
xls
(读取)文件。
So I will use large NR_COLS
value or continue with csv
library. 因此,我将使用较大的
NR_COLS
值或继续使用csv
库。
Thank you! 谢谢!
How about: 怎么样:
>>> txt = 'NAME;a;b;c\nATTR1;1;2;3\nATTR2;1;2;3;;;\nATTR3;1;2;3;\nATTR4;1;2;3'
>>> pd.read_csv(StringIO(txt),sep=";",names=range(4))
0 1 2 3
0 NAME a b c
1 ATTR1 1 2 3
2 ATTR2 1 2 3
3 ATTR3 1 2 3
4 ATTR4 1 2 3
[5 rows x 4 columns]
Sometimes when I don't know how many columns there are beforehand I do something silly like names=range(128)
and then .dropna(how='all', axis=1)
. 有时候,当我事先不知道有多少列时,我会做一些愚蠢的事情,例如
names=range(128)
然后是.dropna(how='all', axis=1)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.