用熊猫读取格式错误的“ csv”文件

Question

I have a malformed "csv" file: 我有一个格式不正确的“ csv”文件：

txt = """NAME;a;b;c
ATTR1;1;2;3
ATTR2;1;2;3;;;
ATTR3;1;2;3;
ATTR4;1;2;3"""

I there a way to use pandas pd.read_* toolbox to get the following pd.DataFrame : 我有一种使用pandas pd.read_*工具箱的方法来获取以下pd.DataFrame ：

|---+-------+---+---+---|
|   | 0     | 1 | 2 | 3 |
|---+-------+---+---+---|
| 0 | NAME  | a | b | c |
| 1 | ATTR1 | 1 | 2 | 3 |
| 2 | ATTR2 | 1 | 2 | 3 |
| 3 | ATTR3 | 1 | 2 | 3 |
| 4 | ATTR4 | 1 | 2 | 3 |
|---+-------+---+---+---|

? ？

PS I know how to do it with import csv PS我知道如何import csv

Thank you for ideas and BR, Lex 感谢您的想法和BR，Lex

EDIT 编辑

This was a toy example from real file (which I again had to modify) ... 这是真实文件中的一个玩具示例（我必须再次对其进行修改）...

SRC = 'https://dl.dropboxusercontent.com/u/40513206/test.csv'
NA_VALUES = ['', '#N/A N/A', '#N/A Field Not Applicable', '#N/A Invalid Field',
         '#N/A Invalid Security', '#N/AN/A', '#N/A Limit', '#####', '#DIV/0!', 
         '#N/A', '#NAME?', '#NULL!', '#NUM!', '#REF!', '#VALUE!']
CSV_ENCODING = 'WINDOWS-1252'
S_ROWS = 6
NR_ROWS = 60
NR_COLS = 52 # correct nr. of columns, but not always known

dat_m = pd.read_csv(SRC, sep = ';', header = None, index_col = None, skiprows = S_ROWS, 
                nrows = NR_ROWS, encoding = CSV_ENCODING, na_values = NA_VALUES, names = range(NR_COLS))

Seems that if we use names parameter then NR_COLS must be >= actual nr. 似乎如果我们使用names参数，则NR_COLS必须>=实际nr。 of columns in first row, if not so, then Index or MultiIndex is formed (based on actual columns), for example if NR_COLS = 50 then index has 2 levels, if NR_COLS = 49 then 3 levels etc. 第一行中的列数，如果不是这样，则形成Index或MultiIndex （基于实际列），例如，如果NR_COLS = 50则索引具有2级，如果NR_COLS = 49则形成3级， NR_COLS = 49 。

All this is a result when I save Excel to csv , it seems to add sep = ';' 这是我将Excel保存到csv时的结果，似乎添加了sep = ';' to some rows and for some other reason I can not use xls (read) files directly. 到某些行，并且由于其他原因，我无法直接使用xls （读取）文件。

So I will use large NR_COLS value or continue with csv library. 因此，我将使用较大的NR_COLS值或继续使用csv库。

Thank you! 谢谢！

Answer 1

How about: 怎么样：

>>> txt = 'NAME;a;b;c\nATTR1;1;2;3\nATTR2;1;2;3;;;\nATTR3;1;2;3;\nATTR4;1;2;3'
>>> pd.read_csv(StringIO(txt),sep=";",names=range(4))
       0  1  2  3
0   NAME  a  b  c
1  ATTR1  1  2  3
2  ATTR2  1  2  3
3  ATTR3  1  2  3
4  ATTR4  1  2  3

[5 rows x 4 columns]

Sometimes when I don't know how many columns there are beforehand I do something silly like names=range(128) and then .dropna(how='all', axis=1) . 有时候，当我事先不知道有多少列时，我会做一些愚蠢的事情，例如names=range(128)然后是.dropna(how='all', axis=1) 。

用熊猫读取格式错误的“ csv”文件

问题描述

1 个解决方案

解决方案1
5 已采纳 2013-12-02 19:33:25

用熊猫读取格式错误的“ csv”文件

问题描述

1 个解决方案

解决方案1 5 已采纳 2013-12-02 19:33:25

解决方案1
5 已采纳 2013-12-02 19:33:25