简体   繁体   中英

Pandas Dataframe- round multiple columns but enforce they sum to one

Lets say I have the following data frame where columns A, B, and C are weights, and the three must sum to one:

df = pd.DataFrame(data=[[0.56, 0.36, 0.08], [0.42, 0.13, 0.45]], columns=['A', 'B', 'C'])

if we apply df.sum(axis=1), we see that indeed they all add up to one. My goal is to have the same set of columns, but rounded to a single decimal (I need to bin my weights to fit in 10% buckets). The problem is when we do this:

df.round(1).sum(axis=1)

We find that the first row sums to 1.1 (0.6 + 0.4 + 0.1), and the second row to 0.9 (0.4 + 0.1 + 0.4). Is there a way in pandas to round while enforcing a "sums to 1" constraint on a number of columns?

No. There are various algorithms you can use to do this job, but they require detailed (ie iterate through the row) processing.

Perhaps the simplest is what we used to call "truncate-allocate". Split each element of the row at the rounding point, keeping the truncated amount and the leftover (the part you use for rounding). For instance, your first row above would leave us:

trunc = [0.50, 0.30, 0.00]
alloc = [0.06, 0.06, 0.08]

Now, observe that sum(trunc) is 0.8 ... there are 2 units to allocate. Find the largest two elements of alloc; these are the last and either of the other two (likely determined by the last bit of the binary representation). Add to those two elements:

trunc = [0.6, 0.3, 0.1]

Now it sums to 1.

Can you work with that? Is it simple enough to solve your problem? I know it's not a built-in function, but it's easy enough to understand, implement, and maintain.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM