简体   繁体   中英

Find shared sub-ranges defined by start and endpoints in pandas dataframe

I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:

df1
Line    startpoint  endpoint    Attribute_A 
100     2.506       2.809       B-70
100     2.809       2.924       B-91
100     2.924       4.065       B-84
100     4.065       4.21        B-70
100     4.21        4.224       B-91
...

df2
Line    startpoint  endpoint    Attribute_B 
100     2.5         2.6         140
100     2.6         2.7         158
100     2.7         2.8         131
100     2.8         2.9         124
100     2.9         3.0         178

...

What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:

df3
Line    startpoint  endpoint    Attribute_A Attribute_B
100     2.5         2.506       nan         140
100     2.506       2.6         B-70        140
100     2.6         2.7         B-70        158
100     2.7         2.8         B-70        131
100     2.8         2.809       B-70        124
100     2.809       2.9         B-91        124
100     2.9         2.924       B-91        178
100     2.924       3.0         B-84        178
...

How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !

Here is my solution, a bit long but it works:

First step is finding the intervals:

all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())

all_points = sorted(list(all_start_points.union(all_end_points)))

intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]

Then we need to find the relevant interval in each dataframe (if present):

import numpy as np
def find_interval(df, interval):
    return df[(df['startpoint']<=interval[0]) &
              (df['endpoint']>=interval[1])]

attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]

attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]

Finally, we put everything together:

out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100

And I get the expected result:

out
Out[111]: 
    startpoint  endpoint Attribute_A  Attribute_B  Line
0        2.500     2.506         NaN        140.0   100
1        2.506     2.600        B-70        140.0   100
2        2.600     2.700        B-70        158.0   100
3        2.700     2.800        B-70        131.0   100
4        2.800     2.809        B-70        124.0   100
5        2.809     2.900        B-91        124.0   100
6        2.900     2.924        B-91        178.0   100
7        2.924     3.000        B-84        178.0   100
8        3.000     4.065        B-84          NaN   100
9        4.065     4.210        B-70          NaN   100
10       4.210     4.224        B-91          NaN   100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM