[英]Replace value from a column based on condition of another column, Pandas
Starting DataFrame首发DataFrame
df = pd.DataFrame({'Column A' : ['red','green','yellow', 'orange', 'red', 'blue'],
'Column B' : [NaN, 'blue', 'purple', NaN, NaN, NaN],
'Column C' : [1, 2, 3, 2, 3, 7]})
Column A ![]() |
Column B ![]() |
Column C![]() |
---|---|---|
'red' ![]() |
NaN![]() |
1 ![]() |
'green' ![]() |
'blue' ![]() |
2 ![]() |
'yellow' ![]() |
'purple' ![]() |
3 ![]() |
'orange' ![]() |
NaN![]() |
2 ![]() |
'red' ![]() |
NaN![]() |
3 ![]() |
'blue' ![]() |
NaN![]() |
7 ![]() |
Desired Result期望的结果
Column A ![]() |
Column B ![]() |
Column C![]() |
---|---|---|
'red' ![]() |
NaN![]() |
1 ![]() |
'blue' ![]() |
'blue' ![]() |
2 ![]() |
'purple' ![]() |
'purple' ![]() |
3 ![]() |
'orange' ![]() |
NaN![]() |
2 ![]() |
'red' ![]() |
NaN![]() |
3 ![]() |
'blue' ![]() |
NaN![]() |
7 ![]() |
I want to replace values in column A only if the value in Column B is not NaN, and to replace column A with the value in Column B我想仅当 B 列中的值不是 NaN 时才替换 A 列中的值,并将 A 列替换为 B 列中的值
So that I can run the following code:这样我就可以运行以下代码:
df[[Column_A, Column_C]].groupby(Column_A).sum()
Which would result in the following DataFrame:这将导致以下 DataFrame:
Column A ![]() |
Column C![]() |
---|---|
'red' ![]() |
4 ![]() |
'blue' ![]() |
9 ![]() |
'purple' ![]() |
3 ![]() |
'orange' ![]() |
2 ![]() |
I am trying to replace categories before doing a groupby
call.我正在尝试在进行
groupby
调用之前替换类别。
Attempts:尝试:
The DataFrame I am working with has a sequential numerical based index going from 0 to N.我正在使用的 DataFrame 有一个从 0 到 N 的基于顺序数字的索引。
So I could hard code the following:所以我可以硬编码以下内容:
df.iloc[[index], column] = some_string
I do not want to do this as it is not dynamic and the DataFrame data could change.我不想这样做,因为它不是动态的,并且 DataFrame 数据可能会更改。
I believe I could use .agg()
or .apply()
on either the df
or the df.groupby()
but this is where I have struggled.我相信我可以在
df
或.apply()
上使用.agg()
或 .apply( df.groupby()
但这是我一直在努力的地方。
Particularly with how to write a function to use with .agg()
or .apply()
特别是如何编写 function 以与
.agg()
或.apply()
一起使用
Say:说:
def my_func(x):
print(x)
Then:然后:
df.apply(my_func)
The result is the first column of df
printed.结果是
df
打印的第一列。
Or:或者:
df.apply(my_func, axis = 1)
The result is the following format for each row:结果是每行的以下格式:
Column A red
Column B Nan
Column C 1
Name: 0, dtype: object
Column A green
Column B blue
Column C 2
Name: 1, dtype: object
I am not sure how to access each column per row in my_func
.我不确定如何访问
my_func
中每行的每一列。
Edit:编辑:
I am trying to find a way to change the value in Column A if the value, for that row, in Column B is not NaN.如果 B 列中该行的值不是 NaN,我试图找到一种方法来更改 A 列中的值。 The value to use for replacing is the value in Column B, the value to replace is the value in Column A if Column B is not NaN.
用于替换的值是 B 列中的值,如果 B 列不是 NaN,则要替换的值是 A 列中的值。
But I want to do this dynamically, meaning not hardcoded as I showed with:但我想动态地执行此操作,这意味着不像我展示的那样硬编码:
df.iloc[[index], column] = some_string
As you mentioned, you could use pd.apply
like this:正如您提到的,您可以像这样使用
pd.apply
:
df['Column A'] = df.apply(lambda x: x['Column B'] if str(x['Column B']) not in ['nan', 'NaN'] else x['Column A'], axis=1)
Column A Column B Column C
0 red NaN 1
1 blue blue 2
2 purple purple 3
3 orange NaN 2
4 red NaN 3
5 blue NaN 7
Notice that apply is not fast at for very large dataset is not advisable.请注意,对于非常大的数据集,应用速度不快是不可取的。 There are some good answers out there for alternative methods
对于替代方法,有一些很好的答案
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.