[英]Compare Same and Different in Two Columns of Dataframe
I have a small dataframe, like this.我有一个小的dataframe,像这样。
import pandas as pd
import numpy as np
# data's stored in dictionary
details = {
'address_id': [1, 1, 1, 2, 2],
'business': ['verizon', 'verizon', 'comcast', 'sprint', 'att']
}
df = pd.DataFrame(details)
print(df)
I am trying to find out if, and when a person switched to a different cell phone service.我试图找出一个人是否以及何时切换到不同的手机服务。
I tried this logic;我试过这个逻辑; didn't work.
没用。
df['new'] = df.Column1.isin(df.Column1) & df[~df.Column2.isin(df.Column2)]
Basically, given index row 0 and row 1, when the address_id was the same, the business was the same, but the business changed from verizon to comcast in index row 2. Also, given index row 3 and row 4, the address_id was the same, but the business changed from sprint to att in index row 4. I'd like to add a new column to the dataframe to flag these changes.基本上,给定索引第0行和第1行,当address_id相同时,业务相同,但是在索引第2行中业务从verizon变为comcast。另外,给定索引第3行和第4行,address_id是相同,但是在索引第 4 行中业务从 sprint 更改为 att。我想在 dataframe 中添加一个新列来标记这些更改。 How can I do that?
我怎样才能做到这一点?
UPDATE : Here is an even simpler way than my original answer using join()
(see below) to do what your question asks:更新:这是一种比我使用
join()
(见下文)的原始答案更简单的方法来完成您的问题:
df['new'] = df.address_id.map(df.groupby('address_id').first().business) != df.business
Explanation:解释:
groupby()
and first()
to create a dataframe whose business
column contains the first one encountered for each address_id
groupby()
和first()
创建一个 dataframe ,其business
列包含每个address_id
遇到的第一个Series.map()
to transform the original dataframe's address_id
column into this first business
valueSeries.map()
将原始数据框的address_id
列转换为第一个business
值new
which is True
only if this new business
differs from the original business
column. True
当此新business
与原始business
列不同时,才添加new
列。 Here is a simple way to do what you've asked using groupby()
and join()
:这是使用
groupby()
和join()
完成您所要求的简单方法:
df = df.join(df.groupby('address_id').first(), on='address_id', rsuffix='_first')
df = df.assign(new=df.business != df.business_first).drop(columns='business_first')f
Output: Output:
address_id business new
0 1 verizon False
1 1 verizon False
2 1 comcast True
3 2 sprint False
4 2 att True
Explanation:解释:
groupby()
and first()
to create a dataframe whose business
column contains the first one encountered for each address_id
groupby()
和first()
创建一个 dataframe ,其business
列包含每个address_id
遇到的第一个join()
to add a column business_first
to df
containing the corresponding first business for each address_id
join()
将列business_first
添加到包含每个address_id
对应的第一个业务的df
assign()
to add a column new
containing a boolean indicating whether the row contains a new business
with an existing address_id
assign()
添加包含 boolean 的new
列,指示该行是否包含具有现有address_id
的新business
drop()
to eliminate the business_first
column.drop()
删除business_first
列。First, groupby
on address_id
.首先,在
groupby
上进行address_id
。
groups = df.groupby("address_id")
Then, iterate over the groups, and find where the value of business
changes:然后,遍历这些组,找出
business
价值发生变化的地方:
for address_id, grp_data in groups:
changed = grp_data['business'].ne(grp_data['business'].shift().bfill())
df.loc[grp_data.index, "changed"] = changed
.shift().bfill()
shifts all data one index over ( 0 -> 1
, 1 -> 2
, and so on), and then backfills the first value. .shift().bfill()
将所有数据移动一个索引( 0 -> 1
、 1 -> 2
等),然后回填第一个值。 For example:例如:
>>> df["business"]
0 verizon
1 verizon
2 comcast
3 sprint
4 att
Name: business, dtype: object
>>> df["business"].shift()
0 NaN
1 verizon
2 verizon
3 comcast
4 sprint
Name: business, dtype: object
>>> df["business"].shift().bfill()
0 verizon
1 verizon
2 verizon
3 comcast
4 sprint
Name: business, dtype: object
Running the loop makes the following dataframe:运行循环会生成以下 dataframe:
address_id business changed
0 1 verizon False
1 1 verizon False
2 1 comcast True
3 2 sprint False
4 2 att True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.