I have a large dataframe that looks like this:
Start End Alm_No1 Val_No1 Alm_No2 Val_No2 Alm_No3 Val_No3
1/1/19 0:00 1/2/19 0:00 1 0 2 1 3 0
1/2/19 0:00 1/3/19 0:00 1 0 2 0 3 1
1/3/19 0:00 1/4/19 0:00 1 1 2 0 3 0
1/4/19 0:00 1/5/19 0:00 1 0 2 0 3 1
1/5/19 0:00 1/6/19 0:00 1 1 2 0 3 0
1/6/19 0:00 1/7/19 0:00 1 0 2 1 3 1
1/7/19 0:00 1/8/19 0:00 4 0 5 1 6 0
1/8/19 0:00 1/9/19 0:00 4 0 5 1 6 1
1/9/19 0:00 1/10/19 0:00 4 1 5 1 6 0
I want to update all values in columns "Val" with the number from the associated "Alm" column if the value is 1 so that I can get rid of the "Alm" columns.
The outcome would look like this:
Start End Alm_No1 Val_No1 Alm_No2 Val_No2 Alm_No3 Val_No3
1/1/19 0:00 1/2/19 0:00 1 0 2 2 3 0
1/2/19 0:00 1/3/19 0:00 1 0 2 0 3 3
1/3/19 0:00 1/4/19 0:00 1 1 2 0 3 0
1/4/19 0:00 1/5/19 0:00 1 0 2 0 3 3
1/5/19 0:00 1/6/19 0:00 1 1 2 0 3 0
1/6/19 0:00 1/7/19 0:00 1 0 2 2 3 3
1/7/19 0:00 1/8/19 0:00 4 0 5 5 6 0
1/8/19 0:00 1/9/19 0:00 4 0 5 5 6 6
1/9/19 0:00 1/10/19 0:00 4 4 5 5 6 0
I have created the list of columns which value should be changed:
val_col = df.columns.tolist()
val_list=[]
for i in range(0, len(val_col)) :
if val_col[i].startswith('Val'):
val_list.append(i)
then I tried creating a while look to iterate over the columns:
for x in val_list:
i = 0
while i < len(df):
if df.iloc[i, x] == 1:
df.iloc[i, x] = df.iloc[i, x-1]
i+=1
It takes forever too load and I have a hard time finding something that works with lambda or apply. Any hint? Thanks in advance!
Never loop over the rows of a dataframe. You should set columns all in one operation.
for i in range(1,4):
df[f'Val_No{i}'] *= df[f'Alm_No{i}']
I feel silly answering my own questions just a few minutes later but I think I found something that works:
for x in val_list:
df.loc[df.iloc[:,x]==1,df.columns[x]] = df.iloc[:, x-1]
Worked like a charm!
234 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I came up with a solution working for arbitrary number of Alm_No... / Val_No... columns.
Let's start from a function to be applied to each row:
def fn(row):
for i in range(2, row.size, 2):
j = i + 1
if row.iloc[j]:
row.iloc[j] = row.iloc[i]
return row
Note the construction of the for loop. It starts from 2 (position of Alm_No1
column), with step 2 (the distance to Alm_No2
column).
j
holds the number of the next column ( Val_No... ).
If the "current" Val_No != 0 then substitute here the value from the "current" Alm_No .
After the loop completes the changed row is returned.
So the only thing to do is to apply this function to each row:
df.apply(fn, axis=1)
My timeit measurements indicated that my solution runs a little (7 %) quicker than yours and about 35 times quicker than the one proposed by BallpointBen .
Apparently, the usage of f-strings has some share in this (quite significant) difference.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.