简体   繁体   English

如何修改 DataFrames 以便它们在 Pandas 中只有具有共享索引值的行?

[英]How to modify DataFrames so that they only have rows with shared index values in Pandas?

So, I'm a data science student working with some data in Python Pandas, and I have two dataframes whose indices are dates (each generated by reading CSV files with pandas.readcsv("filepath", index_col="DATE", parse_dates=True, dayfirst=True)).所以,我是一名数据科学专业的学生,​​在 Python Pandas 中处理一些数据,我有两个数据帧,其索引是日期(每个数据帧都是通过使用 pandas.readcsv("filepath", index_col="DATE", parse_dates=真,dayfirst=真))。 What I want to do is to modify these dataframes so that they each only contain rows whose index value is shared by both of them - that way, I can directly compare them to look for correlations in the data.我想要做的是修改这些数据帧,使它们每个只包含索引值由它们共享的行 - 这样,我可以直接比较它们以查找数据中的相关性。

I've spent the last few hours searching documentation and SO for ways to do this, and at the moment, I've got the following code:在过去的几个小时里,我一直在搜索文档和 SO 以寻找如何做到这一点,目前,我得到了以下代码:

common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
print(normalized_house_index_data.index.values)
df1= df1.take(common_dates)
df2= df2.take(common_dates)

However, this is giving me an index out of bounds error, even though common_dates should be constructed from the items in the index.values array.然而,这给了我一个索引越界错误,即使 common_dates 应该从 index.values 数组中的项目构造。 When I look at the output of the print() statements I added in as part of my troubleshooting, I see the following for common_dates:当我查看作为故障排除一部分添加的 print() 语句的输出时,我看到了 common_dates 的以下内容:

[numpy.datetime64('2000-12-31T00:00:00.000000000'), numpy.datetime64('2001-12-31T00:00:00.000000000'), numpy.datetime64('2004-12-31T00:00:00.000000000'), numpy.datetime64('2003-12-31T00:00:00.000000000'), #and more values

And the following for df1.index.values: df1.index.values 的以下内容:

['2000-12-31T00:00:00.000000000' '2001-12-31T00:00:00.000000000'
 '2002-12-31T00:00:00.000000000' '2003-12-31T00:00:00.000000000' #and more values

The values for df2.index.values look similar to df1. df2.index.values 的值看起来类似于 df1。

['1947-12-31T00:00:00.000000000' '1948-12-31T00:00:00.000000000'
#lots of values
 '1997-12-31T00:00:00.000000000' '1998-12-31T00:00:00.000000000'
 '1999-12-31T00:00:00.000000000' '2000-12-31T00:00:00.000000000'
 '2001-12-31T00:00:00.000000000' '2002-12-31T00:00:00.000000000'
#more values

This gives an "indices out of bounds" error.这给出了“索引越界”错误。 I've tried using list(map(str, common_dates) to convert common_dates to strings, since it looks like there might be some sort of type mismatch, but this gives an "invalid literal for int() with base 10: '2000-12-31T00:00:00.000000000'" error instead; I've tried to similarly convert them to int or numpy.datetime64, but these both give "index out of bounds" errors.我尝试使用 list(map(str, common_dates) 将 common_dates 转换为字符串,因为看起来可能存在某种类型不匹配,但这给出了一个“int() 的无效文字,基数为 10:'2000- 12-31T00:00:00.000000000'" 错误;我尝试类似地将它们转换为 int 或 numpy.datetime64,但这些都给出了“索引越界”错误。

I've also tried an alternate approach using df1.iterrows():我也尝试过使用 df1.iterrows() 的替代方法:

droplist = []
for date, value in df1.iterrows():
    if date not in common_dates:
        droplist.append(date)
df1= df1.drop(droplist)

I also tried a version of this comparing each row's date directly to the values of df2.index.values.我还尝试了一个版本,将每一行的日期直接与 df2.index.values 的值进行比较。 Both of these simply result in all rows being dropped from the table, rather than only the non-matching rows being dropped.这两者都会导致从表中删除所有行,而不仅仅是删除不匹配的行。

What am I doing wrong, here?我在这里做错了什么? Am I simply taking the wrong approach to this, or is there something I'm missing?我只是采取了错误的方法,还是我遗漏了什么?

I think here is problem with take , for me working DataFrame.loc for selecting by common indices:我认为这里是take问题,对我来说,使用DataFrame.loc来按公共索引进行选择:

a = pd.DatetimeIndex(['2000-12-31T00:00:00.000000000',
                      '2001-12-31T00:00:00.000000000',
                      '2002-12-31T00:00:00.000000000', 
                      '2003-12-31T00:00:00.000000000'])

b = pd.DatetimeIndex(['1947-12-31T00:00:00.000000000',
                      '1948-12-31T00:00:00.000000000',
                      '1997-12-31T00:00:00.000000000',
                      '1998-12-31T00:00:00.000000000',
                      '1999-12-31T00:00:00.000000000',
                      '2000-12-31T00:00:00.000000000',
                      '2001-12-31T00:00:00.000000000',
                      '2002-12-31T00:00:00.000000000'])

df1 = pd.DataFrame(index=a)
df2 = pd.DataFrame(index=b)

common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
[numpy.datetime64('2000-12-31T00:00:00.000000000'), 
 numpy.datetime64('2001-12-31T00:00:00.000000000'), 
 numpy.datetime64('2002-12-31T00:00:00.000000000')]

Also is possible use Index.intersection for common indices:也可以将Index.intersection用于常见索引:

common_dates = df1.index.intersection(df2.index)
print(common_dates)
DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31'], 
              dtype='datetime64[ns]', freq='A-DEC')

df1= df1.loc[common_dates]
df2= df2.loc[common_dates]
print (df1)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]

print (df2)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas DataFrames:如何根据另一个数据帧列中的值使用现有数据帧中的索引值定位行? - Pandas DataFrames: How to locate rows using index values in existing dataframe based on values from another dataframe column? 如何在从第一个具有相同索引的数据框中删除行的同时合并熊猫数据框? - How to merge pandas dataframes while removing rows from first dataframe which have the same index? 如何获取熊猫中具有较低频繁值的行的索引 - How to get the index of rows that have lower frequent values in pandas Pandas DataFrames:如何根据索引值的顺序比较删除数据框中的行 - Pandas DataFrames: How to delete rows in a dataframe based on a sequential comparison of their index values 如何修改 Python 中的代码以便仅对 Pandas 中的 NOT NaN 行进行计算? - How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas? 如何检查两个 pandas 数据帧是否具有相同的值并将这些行连接起来? - How to check if two pandas dataframes have same values and concatenate those rows? 连接 Pandas DataFrames 只保留列中具有匹配值的行? - Concatenating pandas DataFrames keeping only rows with matching values in a column? 如何在索引上合并两个 pandas 数据帧但填充缺失值 - How to merge two pandas dataframes on index but fill missing values 如何修改函数以根据 Python Pandas 中的值返回 2 个 DataFrame? - How to modify function so as to return 2 DataFrame depending on values in Python Pandas? 连接每个索引具有不同行的熊猫数据帧 - Concatenate pandas dataframes with varying rows per index
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM