简体   繁体   中英

Pandas dataframe: How to set values after an index to 0

I have a Pandas dataframe, each row contains a name followed by many numbers in the columns. After a specific index for each row (calculated uniquely in every row), I want to set all the remaining values in that row to 0.

So, I tried out a few things and have the below working code:

for i in range(n):
    index = np.where(df.columns == df['match_this_value'][i])[0].item()
    df.iloc[i, index] = df['take_this_value'][i].day 
    df.iloc[i, (index+1):] = 0

However, this takes quite long as my dataset is very large. The runtime is about 70 seconds for my sample dataset, as my entire dataset is much longer. Is there a faster way to do this? Furthermore, is there a better way to do this manipulation without looping through each row?


EDIT: Sorry I should have specified how the index is calculated. the Index is calculated through an np.where by compared all of the columns of the dataframe (for each row) against one specific column and finding the match. so something like:

index = np.where(df.columns == df['match_this_value'][i])[0].item()

Once I have this index, I set the value at that column to the value of another column in the df. The entire code right now looks like this:

for i in range(n):
    index = np.where(df.columns == df['match_this_value'][i])[0].item()
    df.iloc[i, index] = df['take_this_value'][i].day 
    df.iloc[i, (index+1):] = 0

you could do :


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 4), columns=list('ABCD'))

#           A         B         C         D
# 0  0.750017  0.582230  1.411253 -0.379428
# 1 -0.747129  1.800677 -1.243459 -0.098760
# 2 -0.742997 -0.035036  1.012052 -0.767602
# 3 -0.694679  1.013968 -1.000412  0.752191

indexes = np.random.choice(range(df.shape[1]), df.shape[0])
# array([0, 3, 1, 1])
df_indexes = np.tile(range(df.shape[1]), (df.shape[0], 1))
df[df_indexes>indexes[:, None]] = 0
print(df) 
#           A         B         C        D
# 0  0.750017  0.000000  0.000000  0.00000
# 1 -0.747129  1.800677 -1.243459 -0.09876
# 2 -0.742997 -0.035036  0.000000  0.00000
# 3 -0.694679  1.013968  0.000000  0.00000

So here you include a boolean mask df_indexes>indexes[:, None] , and indexes here would be replaced with your "specific indexes"

Consider the following approach:

import numpy as np
import pandas as pd

# dataframe size
R, C = 10_000_000, 10

# sample data
df = pd.DataFrame(
    np.random.random((R, C)),
    columns=['name', *(f'c_{idx}' for idx in range(C - 1))])

# calculating specific index
cut_column = np.random.randint(1, C, (R,))

# handling data column by column
for idx, col in enumerate(df.columns[1:], 1):
    df[col] = np.where(cut_column > idx, df[col], 0)

Running time is on the order of seconds for 10 million rows on my machine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM