简体   繁体   中英

How do I read a fixed width format text file in pandas?

I just got my hands on pandas and am figuring out how I can read a file. The file is from the WRDS database and is the SP500 constituents list all the way back to the 1960s. I checked the file and no matter what I do to import it using read_csv , I still can't display the data correctly.

df = read_csv('sp500-sb.txt')

df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 0 to 1230
Data columns: gvkeyx      from      thru     conm
                                        gvkey      co_conm
...(the column names)
dtypes: object(1)

What does the above chunk of output mean? Anything would be helpful.

pandas.read_fwf() was added in pandas 0.7.3 ( April 2012 ) to handle fixed-width files.

  1. API reference

  2. An example from other question

Wes answered me in an email. Cheers.

This is a fixed-width-format file (not delimited by commas or tabs as usual). I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily. I'll see what I can do. In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv. I suspect with some unix magic you can transform a FWF file into a CSV file.

I recommend following the issue on github as your e-mail is about to disappear from my inbox :)

https://github.com/pydata/pandas/issues/920

best, Wes

What do you mean by display? Doesn't df['gvkey'] give you the data in the gvkey column?

If what you do is print the whole data frame to the console, then take a look at df.to_string() , but it'll be hard to read if you have too many columns. Pandas won't print the whole thing by default if you have too many columns:

import pandas
import numpy 

df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )

print df1   # <--- substitute by df2 to see the difference
print
print df1['col1']
print
print df1.to_string()

user, if you need to deal with the fixed format right now, you can use something like the following:

def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
    reader = open(filename, 'r')
    # skip first rows 
    for i in xrange(ignore_first_rows):
        reader.next()
    if first_column_is_index:
        index = slice(0, fields[1])
        fields = [slice(*x) for x  in zip(fields[1:-1], fields[2:])]
        return ((line[index], [line[x].strip() for x in fields]) for line in reader)
    else:
        fields = [slice(*x) for x  in zip(fields[:-1], fields[1:])]
        return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader)) 

Here's a test program:

import pandas
import numpy
import tempfile

# create a data frame
df = pandas.DataFrame(numpy.random.randn(100, 5))
file_ = tempfile.NamedTemporaryFile(delete=True)
file_.write(df.to_string())
file_.flush()

# specify fields
fields = [0, 3, 12, 22, 32, 42, 52]
df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T

# need to specify the datatypes, otherwise everything is a string
df2 = pandas.DataFrame(df2, dtype=float)
df2.index = [int(x) for x in df2.index]

# check
assert (df - df2).abs().max().max() < 1E-6

This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM