![](/img/trans.png)
[英]Creating a new column from another column + unique numeric index in pandas dataframe
[英]New pandas DataFrame from another DataFrame based on a unique multiple column index
我正在嘗試基於唯一的多列索引從另一個pandas.DataFrame創建一個新的pandas.DataFrame。 我可以使用df.index.drop_duplicates()創建具有正確結果的pandas.core.index.MultiIndex,但我不知道如何將其轉換為pandas.DataFrame。
以下腳本使用SQL查詢創建原始DataFrame。
import sqlite3 as db
import pandas as pd
conn = db.connect('C:/data.db')
query = """SELECT TimeStamp, UnderlyingSymbol, Expiry, Strike, CP, BisectIV, OTMperc FROM ActiveOptions
WHERE TimeStamp = '2015-11-09 16:00:00' AND UnderlyingSymbol = 'INTC' AND
Expiry < '2015-11-27 16:00:00' AND OTMperc < .02 AND OTMperc > -.02
ORDER BY UnderlyingSymbol, Expiry, ABS(OTMperc)"""
df = pd.read_sql_query(sql=query, con=conn,index_col=['TimeStamp', 'UnderlyingSymbol', 'Expiry'],
parse_dates=['TimeStamp', 'Expiry'])
該腳本創建以下DataFrame:
In[6]: df
Out[6]:
Strike CP BisectIV OTMperc
TimeStamp UnderlyingSymbol Expiry
2015-11-09 16:00:00 INTC 2015-11-13 16:00:00 33.5 -1 0.2302 -0.0045
2015-11-13 16:00:00 33.5 1 0.2257 0.0045
2015-11-13 16:00:00 33.0 -1 0.2442 0.0105
2015-11-13 16:00:00 33.0 1 0.2426 -0.0106
2015-11-13 16:00:00 34.0 1 0.2240 0.0191
2015-11-13 16:00:00 34.0 -1 0.2295 -0.0195
2015-11-20 16:00:00 33.5 1 0.2817 0.0045
2015-11-20 16:00:00 33.5 -1 0.2840 -0.0045
2015-11-20 16:00:00 33.0 -1 0.2935 0.0105
2015-11-20 16:00:00 33.0 1 0.2914 -0.0106
2015-11-20 16:00:00 34.0 1 0.2718 0.0191
2015-11-20 16:00:00 34.0 -1 0.2784 -0.0195
使用唯一的多列索引創建新的DataFrame會產生以下輸出:
In[10]: new_df = df.index.drop_duplicates()
In[11]: new_df
Out[11]:
MultiIndex(levels=[[2015-11-09 16:00:00], [u'INTC'], [2015-11-13 16:00:00, 2015-11-20 16:00:00]],
labels=[[0, 0], [0, 0], [0, 1]],
names=[u'TimeStamp', u'UnderlyingSymbol', u'Expiry'])
In[12]: type(new_df)
Out[12]: pandas.core.index.MultiIndex
有任何想法嗎?
問題是您將new_df
設置為索引列表,刪除了重復項:
new_df = df.index.drop_duplicates()
您只需要選擇沒有重復索引的行。 您可以使用duplicated
功能來過濾舊數據框:
new_df = df[~df.index.duplicated()]
一個小例子,基於此 :
#create data sample with multi index
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'one', 'one', 'two', 'one', 'two', 'one', 'one']]
#(the first and last are duplicates)
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
原始數據:
>>> s
first second
bar one -0.932521
one 1.969771
baz one 1.574908
two 0.125159
foo one -0.075174
two 0.777039
qux one -0.992862
one -1.099260
dtype: float64
並過濾掉重復項:
>>> s[~s.index.duplicated()]
first second
bar one -0.932521
baz one 1.574908
two 0.125159
foo one -0.075174
two 0.777039
qux one -0.992862
dtype: float64
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.