[英]Splitting a Dataframe Column in Python
我正在尝试从 Pandas Dataframe df
删除一些行。 它看起来像这样,有 180 行和 2745 列。 我想摆脱那些有行curv_typ
的PYC_RT
和YCIF_RT
。 我也想摆脱geo\\time
列。 我正在从 CSV 文件中提取此数据,并且必须意识到curv_typ,maturity,bonds,geo\\time
及其下方的字符(如PYC_RT,Y1,GBAAA,EA
都在一个列中:
curv_typ,maturity,bonds,geo\time 2015M06D16 2015M06D15 2015M06D11 \
0 PYC_RT,Y1,GBAAA,EA -0.24 -0.24 -0.24
1 PYC_RT,Y1,GBA_AAA,EA -0.02 -0.03 -0.10
2 PYC_RT,Y10,GBAAA,EA 0.94 0.92 0.99
3 PYC_RT,Y10,GBA_AAA,EA 1.67 1.70 1.60
4 PYC_RT,Y11,GBAAA,EA 1.03 1.01 1.09
我决定尝试拆分此列,然后删除生成的单个列,但在代码的最后一行出现错误KeyError: 'curv_typ,maturity,bonds,geo\\time'
df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\\time'].str.split(',').tolist(), df[1:]).stack()
import os
import urllib2
import gzip
import StringIO
import pandas as pd
baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file="
filename = "data/irt_euryld_d.tsv.gz"
outFilePath = filename.split('/')[1][:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read())
#Now have to deal with tsv file
import csv
with open(outFilePath,'rb') as tsvin, open('ECB.csv', 'wb') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
writer = csv.writer(csvout)
for data in tsvin:
writer.writerow(data)
csvout = 'C:\Users\Sidney\ECB.csv'
#df = pd.DataFrame.from_csv(csvout)
df = pd.read_csv('C:\Users\Sidney\ECB.csv', delimiter=',', encoding="utf-8-sig")
print df
df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack()
编辑:从 reptilicus 的回答中,我使用了以下代码:
#Now have to deal with tsv file
import csv
outFilePath = filename.split('/')[1][:-3] #As in the code above, just put here for reference
csvout = 'C:\Users\Sidney\ECB.tsv'
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)
我仍然得到与以前相同的确切输出。
谢谢你
那个 CSV 的格式很糟糕,里面有逗号和制表符分隔的数据。
先去掉逗号:
tr ',' '\t' < irt_euryld_d.tsv > test.tsv
如果你不能使用tr
可以在 python 中做到:
outfile = open("outfile.tsv", "w")
with open("irt_euryld_d.tsz", "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
然后可以在熊猫中很好地加载它:
In [9]: df = DataFrame.from_csv("test.tsv", sep="\t", index_col=False)
In [10]: df
Out[10]:
curv_typ maturity bonds geo\time 2015M06D17 2015M06D16 \
0 PYC_RT Y1 GBAAA EA -0.23 -0.24
1 PYC_RT Y1 GBA_AAA EA -0.05 -0.02
2 PYC_RT Y10 GBAAA EA 0.94 0.94
3 PYC_RT Y10 GBA_AAA EA 1.66 1.67
In [11]: df[df["curv_typ"] != "PYC_RT"]
Out[11]:
curv_typ maturity bonds geo\time 2015M06D17 2015M06D16 \
60 YCIF_RT Y1 GBAAA EA -0.22 -0.23
61 YCIF_RT Y1 GBA_AAA EA 0.04 0.08
62 YCIF_RT Y10 GBAAA EA 2.00 1.97
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.