简体   繁体   中英

Replacing specific values in a Pandas dataframe basing on the values of another column

I have a DataFrame similar to this:

Chr  Start_Position End_Position Type
1    10000          10001        SNP
5    45321          45327        INS
12   44700          44710        DEL

I need to change the values of some cells depending on what Type is:

  • SNP needs Start_Position + 1
  • INS needs End_Position + 1
  • DEL needs Start_Position + 1

My issue is that my current solutions are extremely verbose. What I've tried ( dataframe is the original data source):

snp_records = dataframe.loc[dataframe["Type"] == "SNP", :]
del_records = dataframe.loc[dataframe["Type"] == "DEL", :]
ins_records = dataframe.loc[dataframe["Type"] == "INS", :]

snp_records.loc[:, "Start_Position"] = snp_records["Start_Position"].add(1)
del_records.loc[:, "Start_Position"] = del_records["Start_Position"].add(1)
ins_records.loc[:, "End_Position"] = ins_records["End_Position"].add(1)

dataframe.loc[snp_records.index, "Start_Position"] = snp_records["Start_Position"]
dataframe.loc[del_records.index, "Start_Position"] = del_records["Start_Position"]
dataframe.loc[ins_records.index, "End_Position"] = ins_records["End_Position"]

As I have to do this for more columns than the example (similar concept, though) this becomes very long and verbose, and possibly error prone (in fact, I've made several mistakes just typing down the example) due to all the duplicated lines.

This question is similar to mine , but there the values are predefined, while I need to get them from the data themselves.

You can just do:

df.loc[df['Type'].isin(['SNP','INS']), 'Start_Position'] += 1
df.loc[df['Type'].eq('INS'), 'End_Position'] += 1

For general solution you can pass lists to Series.isin and pass to DataFrame.loc for set values by masks:

start = ['SNP','DEL']
end = ['INS']

df.loc[df['Type'].isin(start), 'Start_Position'] += 1
df.loc[df['Type'].isin(end), 'End_Position'] += 1
print (df)
   Chr  Start_Position  End_Position Type
0    1           10001         10001  SNP
1    5           45321         45328  INS
2   12           44701         44710  DEL

Another ideas with pass both columns in one DataFrame.loc :

m = pd.concat([df['Type'].isin(start), df['Type'].isin(end)], axis=1)
df[[ 'Start_Position', 'End_Position']] += m.to_numpy()
print (df)
   Chr  Start_Position  End_Position Type
0    1           10001         10001  SNP
1    5           45321         45328  INS
2   12           44701         44710  DEL

Or:

m = np.vstack((df['Type'].isin(start), df['Type'].isin(end))).T
df[[ 'Start_Position', 'End_Position']] += m
print (df)
   Chr  Start_Position  End_Position Type
0    1           10001         10001  SNP
1    5           45321         45328  INS
2   12           44701         44710  DEL

Try with np.where

start = ['SNP','DEL']
end = ['INS']

df['Start_Position'] = np.where(df['Type'].isin(start),df['Start_Position']+1,df['Start_Position'])

df['End_Position'] = np.where(df['Type'].isin(end ),df['End_Position']+1,df['End_Position'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM