[英]Replacing specific values in a Pandas dataframe basing on the values of another column
I have a DataFrame similar to this:我有一个类似于此的 DataFrame:
Chr Start_Position End_Position Type
1 10000 10001 SNP
5 45321 45327 INS
12 44700 44710 DEL
I need to change the values of some cells depending on what Type
is:我需要根据
Type
更改某些单元格的值:
SNP
needs Start_Position
+ 1 SNP
需要Start_Position
+ 1INS
needs End_Position
+ 1 INS
需要End_Position
+ 1DEL
needs Start_Position
+ 1 DEL
需要Start_Position
+ 1 My issue is that my current solutions are extremely verbose.我的问题是我目前的解决方案非常冗长。 What I've tried (
dataframe
is the original data source):我试过的(
dataframe
是原始数据源):
snp_records = dataframe.loc[dataframe["Type"] == "SNP", :]
del_records = dataframe.loc[dataframe["Type"] == "DEL", :]
ins_records = dataframe.loc[dataframe["Type"] == "INS", :]
snp_records.loc[:, "Start_Position"] = snp_records["Start_Position"].add(1)
del_records.loc[:, "Start_Position"] = del_records["Start_Position"].add(1)
ins_records.loc[:, "End_Position"] = ins_records["End_Position"].add(1)
dataframe.loc[snp_records.index, "Start_Position"] = snp_records["Start_Position"]
dataframe.loc[del_records.index, "Start_Position"] = del_records["Start_Position"]
dataframe.loc[ins_records.index, "End_Position"] = ins_records["End_Position"]
As I have to do this for more columns than the example (similar concept, though) this becomes very long and verbose, and possibly error prone (in fact, I've made several mistakes just typing down the example) due to all the duplicated lines.因为我必须为比示例更多的列(尽管类似的概念)这样做,所以这变得非常冗长和冗长,并且可能容易出错(事实上,我在输入示例时犯了几个错误)由于所有重复线。
This question is similar to mine , but there the values are predefined, while I need to get them from the data themselves. 这个问题与我的类似,但是这些值是预定义的,而我需要从数据本身中获取它们。
You can just do:你可以这样做:
df.loc[df['Type'].isin(['SNP','INS']), 'Start_Position'] += 1
df.loc[df['Type'].eq('INS'), 'End_Position'] += 1
For general solution you can pass lists to Series.isin
and pass to DataFrame.loc
for set values by masks:对于一般解决方案,您可以将列表传递给
Series.isin
并传递给DataFrame.loc
以通过掩码设置值:
start = ['SNP','DEL']
end = ['INS']
df.loc[df['Type'].isin(start), 'Start_Position'] += 1
df.loc[df['Type'].isin(end), 'End_Position'] += 1
print (df)
Chr Start_Position End_Position Type
0 1 10001 10001 SNP
1 5 45321 45328 INS
2 12 44701 44710 DEL
Another ideas with pass both columns in one DataFrame.loc
:在一个
DataFrame.loc
中传递两列的另一种想法:
m = pd.concat([df['Type'].isin(start), df['Type'].isin(end)], axis=1)
df[[ 'Start_Position', 'End_Position']] += m.to_numpy()
print (df)
Chr Start_Position End_Position Type
0 1 10001 10001 SNP
1 5 45321 45328 INS
2 12 44701 44710 DEL
Or:或者:
m = np.vstack((df['Type'].isin(start), df['Type'].isin(end))).T
df[[ 'Start_Position', 'End_Position']] += m
print (df)
Chr Start_Position End_Position Type
0 1 10001 10001 SNP
1 5 45321 45328 INS
2 12 44701 44710 DEL
Try with np.where
尝试使用
np.where
start = ['SNP','DEL']
end = ['INS']
df['Start_Position'] = np.where(df['Type'].isin(start),df['Start_Position']+1,df['Start_Position'])
df['End_Position'] = np.where(df['Type'].isin(end ),df['End_Position']+1,df['End_Position'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.