[英]How to modify DataFrames so that they only have rows with shared index values in Pandas?
So, I'm a data science student working with some data in Python Pandas, and I have two dataframes whose indices are dates (each generated by reading CSV files with pandas.readcsv("filepath", index_col="DATE", parse_dates=True, dayfirst=True)).所以,我是一名数据科学专业的学生,在 Python Pandas 中处理一些数据,我有两个数据帧,其索引是日期(每个数据帧都是通过使用 pandas.readcsv("filepath", index_col="DATE", parse_dates=真,dayfirst=真))。 What I want to do is to modify these dataframes so that they each only contain rows whose index value is shared by both of them - that way, I can directly compare them to look for correlations in the data.
我想要做的是修改这些数据帧,使它们每个只包含索引值由它们共享的行 - 这样,我可以直接比较它们以查找数据中的相关性。
I've spent the last few hours searching documentation and SO for ways to do this, and at the moment, I've got the following code:在过去的几个小时里,我一直在搜索文档和 SO 以寻找如何做到这一点,目前,我得到了以下代码:
common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
print(normalized_house_index_data.index.values)
df1= df1.take(common_dates)
df2= df2.take(common_dates)
However, this is giving me an index out of bounds error, even though common_dates should be constructed from the items in the index.values array.然而,这给了我一个索引越界错误,即使 common_dates 应该从 index.values 数组中的项目构造。 When I look at the output of the print() statements I added in as part of my troubleshooting, I see the following for common_dates:
当我查看作为故障排除一部分添加的 print() 语句的输出时,我看到了 common_dates 的以下内容:
[numpy.datetime64('2000-12-31T00:00:00.000000000'), numpy.datetime64('2001-12-31T00:00:00.000000000'), numpy.datetime64('2004-12-31T00:00:00.000000000'), numpy.datetime64('2003-12-31T00:00:00.000000000'), #and more values
And the following for df1.index.values: df1.index.values 的以下内容:
['2000-12-31T00:00:00.000000000' '2001-12-31T00:00:00.000000000'
'2002-12-31T00:00:00.000000000' '2003-12-31T00:00:00.000000000' #and more values
The values for df2.index.values look similar to df1. df2.index.values 的值看起来类似于 df1。
['1947-12-31T00:00:00.000000000' '1948-12-31T00:00:00.000000000'
#lots of values
'1997-12-31T00:00:00.000000000' '1998-12-31T00:00:00.000000000'
'1999-12-31T00:00:00.000000000' '2000-12-31T00:00:00.000000000'
'2001-12-31T00:00:00.000000000' '2002-12-31T00:00:00.000000000'
#more values
This gives an "indices out of bounds" error.这给出了“索引越界”错误。 I've tried using list(map(str, common_dates) to convert common_dates to strings, since it looks like there might be some sort of type mismatch, but this gives an "invalid literal for int() with base 10: '2000-12-31T00:00:00.000000000'" error instead; I've tried to similarly convert them to int or numpy.datetime64, but these both give "index out of bounds" errors.
我尝试使用 list(map(str, common_dates) 将 common_dates 转换为字符串,因为看起来可能存在某种类型不匹配,但这给出了一个“int() 的无效文字,基数为 10:'2000- 12-31T00:00:00.000000000'" 错误;我尝试类似地将它们转换为 int 或 numpy.datetime64,但这些都给出了“索引越界”错误。
I've also tried an alternate approach using df1.iterrows():我也尝试过使用 df1.iterrows() 的替代方法:
droplist = []
for date, value in df1.iterrows():
if date not in common_dates:
droplist.append(date)
df1= df1.drop(droplist)
I also tried a version of this comparing each row's date directly to the values of df2.index.values.我还尝试了一个版本,将每一行的日期直接与 df2.index.values 的值进行比较。 Both of these simply result in all rows being dropped from the table, rather than only the non-matching rows being dropped.
这两者都会导致从表中删除所有行,而不仅仅是删除不匹配的行。
What am I doing wrong, here?我在这里做错了什么? Am I simply taking the wrong approach to this, or is there something I'm missing?
我只是采取了错误的方法,还是我遗漏了什么?
I think here is problem with take
, for me working DataFrame.loc
for selecting by common indices:我认为这里是
take
问题,对我来说,使用DataFrame.loc
来按公共索引进行选择:
a = pd.DatetimeIndex(['2000-12-31T00:00:00.000000000',
'2001-12-31T00:00:00.000000000',
'2002-12-31T00:00:00.000000000',
'2003-12-31T00:00:00.000000000'])
b = pd.DatetimeIndex(['1947-12-31T00:00:00.000000000',
'1948-12-31T00:00:00.000000000',
'1997-12-31T00:00:00.000000000',
'1998-12-31T00:00:00.000000000',
'1999-12-31T00:00:00.000000000',
'2000-12-31T00:00:00.000000000',
'2001-12-31T00:00:00.000000000',
'2002-12-31T00:00:00.000000000'])
df1 = pd.DataFrame(index=a)
df2 = pd.DataFrame(index=b)
common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
[numpy.datetime64('2000-12-31T00:00:00.000000000'),
numpy.datetime64('2001-12-31T00:00:00.000000000'),
numpy.datetime64('2002-12-31T00:00:00.000000000')]
Also is possible use Index.intersection
for common indices:也可以将
Index.intersection
用于常见索引:
common_dates = df1.index.intersection(df2.index)
print(common_dates)
DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31'],
dtype='datetime64[ns]', freq='A-DEC')
df1= df1.loc[common_dates]
df2= df2.loc[common_dates]
print (df1)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]
print (df2)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.