[英]How to select rows in a DataFrame based on rows in another DataFrame using Python
I have two dataframes, df1 looks like as follows: 我有两个数据帧,df1如下所示:
id year CalendarWeek DayName interval counts
1 2014 1 sun 10:30 3
1 2014 1 sun 11:30 4
1 2014 2 wed 12:00 5
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 3
2 2014 1 sun 14:30 1
2 2014 1 mon 10:30 2
2 2014 2 wed 14:00 3
2 2014 2 fri 15:00 5
3 2014 1 thu 16:30 2
3 2014 1 thu 17:00 1
3 2014 2 sat 12:00 2
3 2014 2 sat 13:30 3
And df2 looks like as follows: df2如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:00 2
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 10:30 6
1 2014 2 wed 12:00 3
1 2014 2 fri 8:30 1
1 2014 2 fri 9:00 2
2 2014 1 sun 12:30 3
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 9:00 35
2 2014 1 mon 10:30 1
2 2014 2 wed 12:30 23
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 14:30 1
3 2014 1 thu 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
3 2014 2 sat 14:00 2
I want to pick up all rows in df2 that match the columns id,year,CalendarWeek,DayName and interval in df1. 我想在df2中拾取与df1中的列id,year,CalendarWeek,DayName和interval匹配的所有行。 The result I want should looks like as follows:
我想要的结果应如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 12:00 3
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 10:30 1
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
In Python, how to select these specific rows in a dataframe based on columns in another dataframe? 在Python中,如何根据另一个数据框中的列选择数据框中的这些特定行?
Thank you! 谢谢!
Perform a merge
and pass the list of columns to param on
, the default type of merge is 'inner'
which only matches where values exist in both dfs: 执行
merge
并将列列表传递给param on
,合并的默认类型为'inner'
,仅匹配两个dfs中都存在值的位置:
In [2]:
df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
Out[2]:
id year CalendarWeek DayName interval counts NewCounts
0 1 2014 1 sun 10:30 3 4
1 1 2014 1 sun 11:30 4 5
2 1 2014 2 wed 12:00 5 3
3 1 2014 2 fri 9:00 2 2
4 2 2014 1 sun 13:00 3 4
5 2 2014 1 sun 14:30 1 4
6 2 2014 1 mon 10:30 2 1
7 2 2014 2 wed 14:00 3 4
8 2 2014 2 fri 15:00 5 3
9 3 2014 1 thu 16:30 2 34
10 3 2014 1 thu 17:00 1 5
11 3 2014 2 sat 12:00 2 3
12 3 2014 2 sat 13:30 3 4
If your 'id' column is your index, you'd have to reset the index on both df's so that they become a column in the df's, this is because the inner join will produce an incorrect result if you specify the on
list of columns and also specify left_index=True
and right_index=True
: 如果您的“ id”列是索引,则必须在两个df上重置索引,以使其成为df的列,这是因为如果您指定列的
on
列表,则内部联接将产生错误的结果并指定left_index=True
和right_index=True
:
In [4]:
df.merge(df1, on=['year','CalendarWeek','DayName','interval'], left_index=True, right_index=True)
Out[4]:
year CalendarWeek DayName interval counts NewCounts
id
1 2014 1 sun 10:30 3 2
1 2014 1 sun 10:30 3 4
1 2014 1 sun 10:30 3 5
1 2014 1 sun 10:30 3 6
1 2014 1 sun 10:30 3 3
1 2014 1 sun 10:30 3 1
1 2014 1 sun 10:30 3 2
1 2014 1 sun 11:30 4 2
1 2014 1 sun 11:30 4 4
1 2014 1 sun 11:30 4 5
1 2014 1 sun 11:30 4 6
1 2014 1 sun 11:30 4 3
1 2014 1 sun 11:30 4 1
1 2014 1 sun 11:30 4 2
1 2014 2 wed 12:00 5 2
1 2014 2 wed 12:00 5 4
1 2014 2 wed 12:00 5 5
1 2014 2 wed 12:00 5 6
1 2014 2 wed 12:00 5 3
1 2014 2 wed 12:00 5 1
1 2014 2 wed 12:00 5 2
1 2014 2 fri 9:00 2 2
1 2014 2 fri 9:00 2 4
1 2014 2 fri 9:00 2 5
1 2014 2 fri 9:00 2 6
1 2014 2 fri 9:00 2 3
1 2014 2 fri 9:00 2 1
1 2014 2 fri 9:00 2 2
2 2014 1 sun 13:00 3 3
2 2014 1 sun 13:00 3 4
.. ... ... ... ... ... ...
2 2014 2 fri 15:00 5 4
2 2014 2 fri 15:00 5 3
3 2014 1 thu 16:30 2 1
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 34
3 2014 1 thu 16:30 2 5
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 4
3 2014 1 thu 16:30 2 2
3 2014 1 thu 17:00 1 1
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 34
3 2014 1 thu 17:00 1 5
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 4
3 2014 1 thu 17:00 1 2
3 2014 2 sat 12:00 2 1
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 34
3 2014 2 sat 12:00 2 5
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 4
3 2014 2 sat 12:00 2 2
3 2014 2 sat 13:30 3 1
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 34
3 2014 2 sat 13:30 3 5
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 4
3 2014 2 sat 13:30 3 2
[96 rows x 6 columns]
so to reset the index just do df = df.reset_index(0)
and likewise for the other df, after merging you can then set the index back to id so: 因此,要重置索引,只需执行
df = df.reset_index(0)
,对于其他df同样如此,合并后,您可以将索引设置回id,这样:
merged = df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
merged = merged.reset_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.