简体   繁体   English

基于唯一的多列索引的另一个DataFrame的新pandas DataFrame

[英]New pandas DataFrame from another DataFrame based on a unique multiple column index

I'm trying to create a new pandas.DataFrame from another pandas.DataFrame based on a unique multiple column index. 我正在尝试基于唯一的多列索引从另一个pandas.DataFrame创建一个新的pandas.DataFrame。 I'm able to create a pandas.core.index.MultiIndex using df.index.drop_duplicates() with the correct results, but I can't figure out how to convert it to a pandas.DataFrame. 我可以使用df.index.drop_duplicates()创建具有正确结果的pandas.core.index.MultiIndex,但我不知道如何将其转换为pandas.DataFrame。

The following script creates the original DataFrame using a SQL Query. 以下脚本使用SQL查询创建原始DataFrame。

import sqlite3 as db
import pandas as pd

conn = db.connect('C:/data.db')
query = """SELECT TimeStamp, UnderlyingSymbol, Expiry, Strike, CP, BisectIV, OTMperc FROM ActiveOptions
           WHERE TimeStamp = '2015-11-09 16:00:00' AND UnderlyingSymbol = 'INTC' AND
           Expiry < '2015-11-27 16:00:00' AND OTMperc < .02  AND OTMperc > -.02
           ORDER BY UnderlyingSymbol, Expiry, ABS(OTMperc)"""

df = pd.read_sql_query(sql=query, con=conn,index_col=['TimeStamp', 'UnderlyingSymbol', 'Expiry'],
                       parse_dates=['TimeStamp', 'Expiry'])

The script creates the following DataFrame: 该脚本创建以下DataFrame:

In[6]: df
Out[6]: 
                                                          Strike  CP  BisectIV  OTMperc
TimeStamp           UnderlyingSymbol Expiry                                            
2015-11-09 16:00:00 INTC             2015-11-13 16:00:00    33.5  -1    0.2302  -0.0045
                                     2015-11-13 16:00:00    33.5   1    0.2257   0.0045
                                     2015-11-13 16:00:00    33.0  -1    0.2442   0.0105
                                     2015-11-13 16:00:00    33.0   1    0.2426  -0.0106
                                     2015-11-13 16:00:00    34.0   1    0.2240   0.0191
                                     2015-11-13 16:00:00    34.0  -1    0.2295  -0.0195

                                     2015-11-20 16:00:00    33.5   1    0.2817   0.0045
                                     2015-11-20 16:00:00    33.5  -1    0.2840  -0.0045
                                     2015-11-20 16:00:00    33.0  -1    0.2935   0.0105
                                     2015-11-20 16:00:00    33.0   1    0.2914  -0.0106
                                     2015-11-20 16:00:00    34.0   1    0.2718   0.0191
                                     2015-11-20 16:00:00    34.0  -1    0.2784  -0.0195

Creating a new DataFrame with a unique multiple column index generates the following output: 使用唯一的多列索引创建新的DataFrame会产生以下输出:

In[10]: new_df = df.index.drop_duplicates()
In[11]: new_df
Out[11]: 
MultiIndex(levels=[[2015-11-09 16:00:00], [u'INTC'], [2015-11-13 16:00:00, 2015-11-20 16:00:00]],
           labels=[[0, 0], [0, 0], [0, 1]],
           names=[u'TimeStamp', u'UnderlyingSymbol', u'Expiry'])

In[12]: type(new_df)
Out[12]: pandas.core.index.MultiIndex

Any ideas? 有任何想法吗?

The problem is that you set new_df to the index list with the duplicates removed: 问题是您将new_df设置为索引列表,删除了重复项:

new_df = df.index.drop_duplicates()

What you want is to select only the rows which do not have duplicate indices. 您只需要选择没有重复索引的行。 You can use the duplicated function to filter your old data frame: 您可以使用duplicated功能来过滤旧数据框:

new_df = df[~df.index.duplicated()]

A small example, based on this : 一个小例子,基于

#create data sample with multi index
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'one', 'one', 'two', 'one', 'two', 'one', 'one']]
#(the first and last are duplicates)
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)

The original data: 原始数据:

>>> s
first  second
bar    one      -0.932521
       one       1.969771
baz    one       1.574908
       two       0.125159
foo    one      -0.075174
       two       0.777039
qux    one      -0.992862
       one      -1.099260
dtype: float64

And filtered for duplicates: 并过滤掉重复项:

>>> s[~s.index.duplicated()]
first  second
bar    one      -0.932521
baz    one       1.574908
       two       0.125159
foo    one      -0.075174
       two       0.777039
qux    one      -0.992862
dtype: float64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从另一列创建新列 + pandas dataframe 中的唯一数字索引 - Creating a new column from another column + unique numeric index in pandas dataframe Pandas:根据另一个数据框中的值更新数据框中的多列 - Pandas : Updating multiple column in a dataframe based on values from another dataframe 如何基于多列从另一个 dataframe 中提取 pandas dataframe? - how to extract pandas dataframe from another dataframe based on multiple column? 将 pandas dataframe 从列重塑为唯一索引 - Reshape pandas dataframe from column to unique index 熊猫:根据另一个数据框中的索引更新列 - Pandas: update a column based on index in another dataframe Pandas 数据框根据另一列的条件创建新行 - Pandas dataframe create new rows based on condition from another column 基于匹配来自另一个数据帧pandas的值的新列 - New column based on matching values from another dataframe pandas Pandas 根据来自另一个 dataframe 的计数和条件创建新列 - Pandas Create new column based on a count and a condition from another dataframe 如何基于另一个DataFrame中的列在Pandas DataFrame中创建新列? - How to create a new column in a Pandas DataFrame based on a column in another DataFrame? 根据索引使Pandas Dataframe列等于另一个Dataframe中的值 - Make Pandas Dataframe column equal to value in another Dataframe based on index
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM