简体   繁体   中英

Pandas Subtracting between two Data Frames

DFOne

 1. ID-1  NumberValueCol1- 10 
 2. ID-2  NumberValueCol1--11
 3. ID-3  NumberValueCol1--20
 4. ID-4  NumberValueCol1--13
 5. ID-5  NumberValueCol1--15

DFTwo

 1. ID-1  NumberValueCol1- 5
 2. ID-2  NumberValueCol1--7
 3. ID-3  NumberValueCol1--9
 4. ID-4  NumberValueCol1--6
 5. ID-5  NumberValueCol1--3

I need to subtract DFOne.NumberValueCol1 from each value in DFTwo until I get the least difference.

the first iteration would subtract DFOne.NumberValueCol1--10, from every value in DFTwo and that would result in

ID Results (DFOne.NumberValueCol1, 10 value each DFTwo.NumberValueCol2 values)

 1. Result - 5
 2. Result - 3
 3. Result - 1
 4. Result - 4
 5. Result - 7

In this case, ID 3--DFTwo.NumberValueCol2 (9), yields the smallest difference of 1. So I would like to map this value to DFOne.NumberValueCol1 -- 10.

The second iteration would start with ID 2, DFOne.NumberValueCol1 value 11. However, Instead of starting the subtraction from the beginning of DFTwo.NumberValueCol2, it would start at the next available ID from the point that there was a match. So, since there was a match with ID 3, the next starting point would be ID 4, and it would do the same as the first logic to get the smallest difference

I hope this is not too confusing. I come from the t-sql world, so I'm trying to understand how to do this type of calculation using Pandas instead of the traditional sql server cursors.

You problem is summarized as:

  1. Find the maximum value in DFTwo, subtract that from the first value in DFOne.
  2. Using the index of the maximum value in DFTwo, slice DFTwo onwards from that index.
  3. Go to Step 1, using the second row of DFone.

A working example:

import pandas as pd

df1 = {'id': [1,2,3,4,5], 'value': [10,11,20,13,15]}
df2 = {'id': [1,2,3,4,5], 'value': [5,7,9,6,3]}

df1 = pd.DataFrame(data=df1)
df2 = pd.DataFrame(data=df2)
print("DFTwo")
print(df2)
print('\n')
min_index = 0
df_output = []
for i in df1['value']:
    try:
        new_val = i - max(df2['value'])
        max_index = int(df2['id'][df2['value'] == max(df2['value'])].values)
        df2 = df2.iloc[max_index:,]
        df_output.append( (max_index, new_val) )
    except:
        break
print("Output")
print(pd.DataFrame(df_output, columns = ['id','result']))

However, we run into the issue here that DFTwo is eventually nil .

2 -- 1
   id  value
3   4      6
4   5      3
0 -- 5
   id  value
4   5      3
0 -- 17
Empty DataFrame
Columns: [id, value]
Index: []
Traceback (most recent call last):
  File "C:/Users/Tyler/Desktop/pd_test.py", line 11, in <module>
    new_val = i - max(df2['value'])
ValueError: max() arg is an empty sequence

The output with the new except clause:

DFTwo
   id  value
0   1      5
1   2      7
2   3      9
3   4      6
4   5      3


Output
   id  result
0   3       1
1   4       5

Ostensibly, this won't be an issue in your real-world use case, as DFTwo is large enough to support this slicing? Without more information on the actual business logic, this is my best attempt.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM