[英]How do I read a fixed width format text file in pandas?
I just got my hands on pandas and am figuring out how I can read a file.我刚刚接触了熊猫,正在研究如何读取文件。 The file is from the WRDS database and is the SP500 constituents list all the way back to the 1960s.
该文件来自 WRDS 数据库,是可追溯到 1960 年代的 SP500 成分列表。 I checked the file and no matter what I do to import it using
read_csv
, I still can't display the data correctly.我检查了文件,无论我如何使用
read_csv
导入它,我仍然无法正确显示数据。
df = read_csv('sp500-sb.txt')
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 0 to 1230
Data columns: gvkeyx from thru conm
gvkey co_conm
...(the column names)
dtypes: object(1)
What does the above chunk of output mean?上面的输出块是什么意思? Anything would be helpful.
任何事情都会有所帮助。
pandas.read_fwf()
was added in pandas 0.7.3 ( April 2012 ) to handle fixed-width files. pandas.read_fwf()
已在 pandas 0.7.3( 2012 年 4 月)中添加以处理固定宽度的文件。
Wes answered me in an email.韦斯在一封电子邮件中回复了我。 Cheers.
干杯。
This is a fixed-width-format file (not delimited by commas or tabs as usual).
这是一个固定宽度格式的文件(不像往常那样用逗号或制表符分隔)。 I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily.
我意识到熊猫没有像 R 那样的固定宽度阅读器,尽管可以很容易地塑造它。 I'll see what I can do.
我会看看我能做什么。 In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv.
同时,如果您可以以另一种格式导出数据(例如 csv——真正以逗号分隔),您将能够使用 read_csv 读取它。 I suspect with some unix magic you can transform a FWF file into a CSV file.
我怀疑使用一些 unix 魔法可以将 FWF 文件转换为 CSV 文件。
I recommend following the issue on github as your e-mail is about to disappear from my inbox :)
我建议关注 github 上的问题,因为您的电子邮件即将从我的收件箱中消失:)
https://github.com/pydata/pandas/issues/920
https://github.com/pydata/pandas/issues/920
best, Wes
最好的,韦斯
What do you mean by display?你说的显示是什么意思? Doesn't
df['gvkey']
give you the data in the gvkey column? df['gvkey']
不是给你 gvkey 列中的数据吗?
If what you do is print the whole data frame to the console, then take a look at df.to_string()
, but it'll be hard to read if you have too many columns.如果您所做的是将整个数据框打印到控制台,请查看
df.to_string()
,但如果您有太多列,则很难阅读。 Pandas won't print the whole thing by default if you have too many columns:如果列太多,Pandas 默认不会打印整个内容:
import pandas
import numpy
df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )
print df1 # <--- substitute by df2 to see the difference
print
print df1['col1']
print
print df1.to_string()
user, if you need to deal with the fixed format right now, you can use something like the following:用户,如果您现在需要处理固定格式,您可以使用以下内容:
def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
reader = open(filename, 'r')
# skip first rows
for i in xrange(ignore_first_rows):
reader.next()
if first_column_is_index:
index = slice(0, fields[1])
fields = [slice(*x) for x in zip(fields[1:-1], fields[2:])]
return ((line[index], [line[x].strip() for x in fields]) for line in reader)
else:
fields = [slice(*x) for x in zip(fields[:-1], fields[1:])]
return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader))
Here's a test program:这是一个测试程序:
import pandas
import numpy
import tempfile
# create a data frame
df = pandas.DataFrame(numpy.random.randn(100, 5))
file_ = tempfile.NamedTemporaryFile(delete=True)
file_.write(df.to_string())
file_.flush()
# specify fields
fields = [0, 3, 12, 22, 32, 42, 52]
df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T
# need to specify the datatypes, otherwise everything is a string
df2 = pandas.DataFrame(df2, dtype=float)
df2.index = [int(x) for x in df2.index]
# check
assert (df - df2).abs().max().max() < 1E-6
This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.如果您现在需要它,这应该可以解决问题,但请记住,上面的函数非常简单,特别是它对数据类型没有任何作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.