简体   繁体   中英

pandas read_csv does not capture final (unnamed) column into dataframe

I am trying to read a csv file in the following format

myHeader
myJunk
myDate
A, B, C, D
, b, c, d
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING

When I create my data frame using

dlogframe = pd.read_csv(myPath, header=3)

I get the following error (my data is more complex than above example, but functionally identical)

pandas._parser.CParserError: Error tokenizing data. C error: Expected 393 fields in line 9, saw 394

How can I give the EXTRA_INFO column a name and have those strings included in my dataframe?

[EDIT]

I figured out how to skip the troublesome row, but now the data is not aligned properly

from StringIO import StringIO
s = """myHeader
myJunk
myDate
A, B, C, D
, b, c, d
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING"""

df = pd.read_csv(StringIO(s), header=3, skiprows=[4])
>>print df

            A       B       C                   D
dataA   dataB   dataC   dataD   EXTRA_INFO_STRING
dataA   dataB   dataC   dataD   EXTRA_INFO_STRING
dataA   dataB   dataC   dataD   EXTRA_INFO_STRING

What I want is:

A       B       C       D       MY_INFO
dataA   dataB   dataC   dataD   EXTRA_INFO_STRING
dataA   dataB   dataC   dataD   EXTRA_INFO_STRING
dataA   dataB   dataC   dataD   EXTRA_INFO_STRING

If only row after header is missing EXTRA_INFO_STRING , you can load column names and data separately:

from StringIO import StringIO
df = pd.read_csv(StringIO(s), header=None, skiprows=5)

Following code (may be not a very elegant one) will load column names:

df.columns = pd.read_csv(StringIO(s), header=None, 
                       nrows=1, skiprows=3).T.append(['MY_INFO'])[0]
# 0      A       B       C       D             MY_INFO
# 0  dataA   dataB   dataC   dataD   EXTRA_INFO_STRING
# 1  dataA   dataB   dataC   dataD   EXTRA_INFO_STRING
# 2  dataA   dataB   dataC   dataD   EXTRA_INFO_STRING

Data used in test:

s = """myHeader
myJunk
myDate
A, B, C, D
, b, c, d
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING
dataA, dataB, dataC, dataD, EXTRA_INFO_STRING"""

How about:

df = pd.read_csv(StringIO(s), skiprows=5, header = None, index_col = False)
df.columns = list("ABCDE")

Sometimes if you have problem with read_csv numeric conversions you could add dtype=object into read_csv call and deal with conversions later on your own using DataFrame.astype.

Here is something I tried that seems to get data in the format that you want. Basic idea is 'ignore all problematic rows' (that's possible if you know about the file structure).

x = pd.read_csv(StringIO.StringIO(s), names=['a', 'b', 'c', 'd', 'more_info'], header=None, skiprows=5)

This gives output in the format you desire.

My experience has been with read_csv , one has to try a few combinations before one gets what one wants.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM