简体   繁体   English

如何在 Pandas 中读取固定宽度格式的文本文件?

[英]How do I read a fixed width format text file in pandas?

I just got my hands on pandas and am figuring out how I can read a file.我刚刚接触了熊猫,正在研究如何读取文件。 The file is from the WRDS database and is the SP500 constituents list all the way back to the 1960s.该文件来自 WRDS 数据库,是可追溯到 1960 年代的 SP500 成分列表。 I checked the file and no matter what I do to import it using read_csv , I still can't display the data correctly.我检查了文件,无论我如何使用read_csv导入它,我仍然无法正确显示数据。

df = read_csv('sp500-sb.txt')

df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 0 to 1230
Data columns: gvkeyx      from      thru     conm
                                        gvkey      co_conm
...(the column names)
dtypes: object(1)

What does the above chunk of output mean?上面的输出块是什么意思? Anything would be helpful.任何事情都会有所帮助。

pandas.read_fwf() was added in pandas 0.7.3 ( April 2012 ) to handle fixed-width files. pandas.read_fwf()已在 pandas 0.7.3( 2012 年 4 月)中添加以处理固定宽度的文件。

  1. API reference API参考

  2. An example from other question 其他问题的一个例子

Wes answered me in an email.韦斯在一封电子邮件中回复了我。 Cheers.干杯。

This is a fixed-width-format file (not delimited by commas or tabs as usual).这是一个固定宽度格式的文件(不像往常那样用逗号或制表符分隔)。 I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily.我意识到熊猫没有像 R 那样的固定宽度阅读器,尽管可以很容易地塑造它。 I'll see what I can do.我会看看我能做什么。 In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv.同时,如果您可以以另一种格式导出数据(例如 csv——真正以逗号分隔),您将能够使用 read_csv 读取它。 I suspect with some unix magic you can transform a FWF file into a CSV file.我怀疑使用一些 unix 魔法可以将 FWF 文件转换为 CSV 文件。

I recommend following the issue on github as your e-mail is about to disappear from my inbox :)我建议关注 github 上的问题,因为您的电子邮件即将从我的收件箱中消失:)

https://github.com/pydata/pandas/issues/920 https://github.com/pydata/pandas/issues/920

best, Wes最好的,韦斯

What do you mean by display?你说的显示是什么意思? Doesn't df['gvkey'] give you the data in the gvkey column? df['gvkey']不是给你 gvkey 列中的数据吗?

If what you do is print the whole data frame to the console, then take a look at df.to_string() , but it'll be hard to read if you have too many columns.如果您所做的是将整个数据框打印到控制台,请查看df.to_string() ,但如果您有太多列,则很难阅读。 Pandas won't print the whole thing by default if you have too many columns:如果列太多,Pandas 默认不会打印整个内容:

import pandas
import numpy 

df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )

print df1   # <--- substitute by df2 to see the difference
print
print df1['col1']
print
print df1.to_string()

user, if you need to deal with the fixed format right now, you can use something like the following:用户,如果您现在需要处理固定格式,您可以使用以下内容:

def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
    reader = open(filename, 'r')
    # skip first rows 
    for i in xrange(ignore_first_rows):
        reader.next()
    if first_column_is_index:
        index = slice(0, fields[1])
        fields = [slice(*x) for x  in zip(fields[1:-1], fields[2:])]
        return ((line[index], [line[x].strip() for x in fields]) for line in reader)
    else:
        fields = [slice(*x) for x  in zip(fields[:-1], fields[1:])]
        return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader)) 

Here's a test program:这是一个测试程序:

import pandas
import numpy
import tempfile

# create a data frame
df = pandas.DataFrame(numpy.random.randn(100, 5))
file_ = tempfile.NamedTemporaryFile(delete=True)
file_.write(df.to_string())
file_.flush()

# specify fields
fields = [0, 3, 12, 22, 32, 42, 52]
df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T

# need to specify the datatypes, otherwise everything is a string
df2 = pandas.DataFrame(df2, dtype=float)
df2.index = [int(x) for x in df2.index]

# check
assert (df - df2).abs().max().max() < 1E-6

This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.如果您现在需要它,这应该可以解决问题,但请记住,上面的函数非常简单,特别是它对数据类型没有任何作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 numpy genfromtxt 读取固定宽度、混合格式的文件? - How do I use numpy genfromtxt to read a fixed width, mixed format file? 在熊猫中使用varchar读取固定宽度的文本文件 - Read fixed-width text file with varchar in pandas Python:如何格式化固定宽度的数字? - Python: How do I format numbers for a fixed width? 如何使用 pandas 读取最后一列宽度可变(参差不齐)的固定宽度文件? - How to read a fixed-width file where the width of the last column is variable (ragged) using pandas? 如何在Python中格式化具有固定宽度的文本块? - How to format a block of text with fixed width in Python? Python输出固定宽度格式的文本文件,带有特殊行,就像SAS一样 - Python output fixed width format text file with special lines as SAS do 如何读取txt.file中没有分隔符或固定宽度的数据框 - How to read a data frame in txt.file that doesn't have separator or fixed width with pandas 如何使用特定格式的熊猫从文本文件读取数据? - How to read data from text file using pandas in a specific format? 从zipfiles读取固定宽度的文本文件到Pandas数据帧 - Reading fixed-width text file from zipfiles into Pandas dataframe 将有问题的固定宽度文本文件解析为 pandas dataframe - Parse Problematic Fixed width text file to a pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM