I have two datasets that have an ID that overlaps. To make this shorter, I'm only going to post the ID that has an overlap. For the From/To interval that overlaps, I want to choose the second dataset, df2, except with Python I don't know how to do it. I know its probably easier/easiest with SQL but I want to know if it is possible with Python. There are extra variables in df2 that I want to come along for the ride but for the variables that are the same, I want to choose df2 instead of df1 for the From/To overlap between the two.
df1
ID | From | To | Q | RM | RQ |
---|---|---|---|---|---|
MRC-17 | 447 | 472 | 0.63 | 42 | 10 |
MRC-17 | 472 | 502 | 2.5 | 42 | 20 |
MRC-17 | 502 | 503.8 | 2.5 | 37 | 10 |
MRC-17 | 503.8 | 509.7 | 0.42 | 29 | 10 |
MRC-17 | 509.7 | 527 | 0.38 | 32 | 10 |
MRC-17 | 527 | 545 | 0.38 | 32 | 10 |
MRC-17 | 545 | 551 | 3.33 | 47 | 26.67 |
MRC-17 | 551 | 576 | 0.38 | 32 | 10 |
MRC-17 | 576 | 579.5 | 6.07 | 47 | 48.57 |
MRC-17 | 579.5 | 597 | 0.38 | 32 | 10 |
MRC-17 | 597 | 616 | 0.38 | 32 | 10 |
MRC-17 | 616 | 626 | 4.75 | 47 | 38 |
MRC-17 | 626 | 647 | 0.38 | 32 | 10 |
MRC-17 | 647 | 662 | 0.83 | 34 | 10 |
MRC-17 | 662 | 677 | 0.38 | 37 | 10 |
df2
ID | From | To | H | DP | DR | IV | No | RQ | RM | Q |
---|---|---|---|---|---|---|---|---|---|---|
MRC-17 | 499 | 504 | 1 | U | S | D | 7 | 50 | 32 | 2.08 |
MRC-17 | 504 | 510 | 2 | P | R | D | 7 | 25 | 32 | 0.78 |
MRC-17 | 510 | 545 | 0 | P | K | F | 9 | 5 | 18 | 0.02 |
MRC-17 | 545 | 565 | 0 | P | K | F | 8 | 60 | 28 | 0.33 |
MRC-17 | 565 | 575 | 0 | P | K | F | 9 | 5 | 18 | 0.02 |
MRC-17 | 575 | 581 | 1 | P | K | F | 7 | 70 | 34 | 0.49 |
MRC-17 | 581 | 600 | 0 | P | K | F | 8 | 20 | 23 | 0.11 |
MRC-17 | 600 | 612 | 0 | P | K | F | 9 | 5 | 18 | 0.02 |
MRC-17 | 612 | 634 | 1 | P | S | C | 7 | 70 | 38 | 2.92 |
MRC-17 | 634 | 647 | 0 | P | S | F | 9 | 5 | 22 | 0.04 |
MRC-17 | 647 | 662 | 2 | P | S | B | 7 | 55 | 39 | 4.58 |
MRC-17 | 662 | 677 | 0 | P | S | F | 9 | 15 | 22 | 0.13 |
Resulting in Final (-99 means missing for numeric, X for char):
ID | From | To | H | DP | DR | IV | No | RQ | RM | Q |
---|---|---|---|---|---|---|---|---|---|---|
MRC-17 | 447 | 472 | -99 | X | X | X | -99 | 10 | 42 | 0.63 |
MRC-17 | 472 | 499 | -99 | X | X | X | -99 | 20 | 42 | 2.50 |
MRC-17 | 499 | 504 | 1 | U | S | D | 7 | 50 | 32 | 2.08 |
MRC-17 | 504 | 510 | 2 | P | R | D | 7 | 25 | 32 | 0.78 |
MRC-17 | 510 | 545 | 0 | P | K | F | 9 | 5 | 18 | 0.02 |
MRC-17 | 545 | 565 | 0 | P | K | F | 8 | 60 | 28 | 0.33 |
MRC-17 | 565 | 575 | 0 | P | K | F | 9 | 5 | 18 | 0.02 |
MRC-17 | 575 | 581 | 1 | P | K | F | 7 | 70 | 34 | 0.49 |
MRC-17 | 581 | 600 | 0 | P | K | F | 8 | 20 | 23 | 0.11 |
MRC-17 | 600 | 612 | 0 | P | K | F | 9 | 5 | 18 | 0.02 |
MRC-17 | 612 | 634 | 1 | P | S | C | 7 | 70 | 38 | 2.92 |
MRC-17 | 634 | 647 | 0 | P | S | F | 9 | 5 | 22 | 0.04 |
MRC-17 | 647 | 662 | 2 | P | S | B | 7 | 55 | 39 | 4.58 |
MRC-17 | 662 | 677 | 0 | P | S | F | 9 | 15 | 22 | 0.13 |
Thanks in advance for all the help!
So far all I've done is load the data:
# Load libraries
import pandas as pd
import numpy as np
from scipy import stats
df1 = pd.read_csv('LOGGED_DATA.csv')
df2 = pd.read_csv('PHOTOLOGGED_DATA.csv')
But I'm having a hard time trying to figure out how to go about this. I looked at inner, outer, etc joins. But having the interval overlap, is throwing it off!
With the dataframes you provided:
df1 = pd.DataFrame({'ID': ['MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17'], 'From': [447.0, 472.0, 502.0, 503.8, 509.7, 527.0, 545.0, 551.0, 576.0, 579.5, 597.0, 616.0, 626.0, 647.0, 662.0], 'To': [472.0, 502.0, 503.8, 509.7, 527.0, 545.0, 551.0, 576.0, 579.5, 597.0, 616.0, 626.0, 647.0, 662.0, 677.0], 'Q': [0.63, 2.5, 2.5, 0.42, 0.38, 0.38, 3.33, 0.38, 6.07, 0.38, 0.38, 4.75, 0.38, 0.83, 0.38], 'RM': [42, 42, 37, 29, 32, 32, 47, 32, 47, 32, 32, 47, 32, 34, 37], 'RQ': [10.0, 20.0, 10.0, 10.0, 10.0, 10.0, 26.67, 10.0, 48.57, 10.0, 10.0, 38.0, 10.0, 10.0, 10.0]})
df2 = pd.DataFrame({'ID': ['MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17', 'MRC-17'], 'From': [499, 504, 510, 545, 565, 575, 581, 600, 612, 634, 647, 662], 'To': [504, 510, 545, 565, 575, 581, 600, 612, 634, 647, 662, 677], 'H': [1, 2, 0, 0, 0, 1, 0, 0, 1, 0, 2, 0], 'DP': ['U', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P', 'P'], 'DR': ['S', 'R', 'K', 'K', 'K', 'K', 'K', 'K', 'S', 'S', 'S', 'S'], 'IV': ['D', 'D', 'F', 'F', 'F', 'F', 'F', 'F', 'C', 'F', 'B', 'F'], 'No': [7, 7, 9, 8, 9, 7, 8, 9, 7, 9, 7, 9], 'RQ': [50, 25, 5, 60, 5, 70, 20, 5, 70, 5, 55, 15], 'RM': [32, 32, 18, 28, 18, 34, 23, 18, 38, 22, 39, 22], 'Q': [2.08, 0.78, 0.02, 0.33, 0.02, 0.49, 0.11, 0.02, 2.92, 0.04, 4.58, 0.13]})
Here is one way to do it:
# Select non entirely overlapping rows from df1
mask = (df1["From"] >= df2["From"].min()) & (df1["From"] <= df2["From"].max()) | (
df1["To"] >= df2["To"].min()
) & (df1["To"] <= df2["To"].max())
df1 = df1[~mask]
# Fix end value
df1.loc[df1.shape[0] - 1, "To"] = df2["From"].min()
# Make new dataframe from sliced df1 and df2, do some cleanup
new_df = (
pd.concat([df1, df2])
.fillna(value={"H": -99, "No": -99, "DP": "X", "DR": "X", "IV": "X"})
.reindex(
["ID", "From", "To", "H", "DP", "DR", "IV", "No", "RQ", "RM", "Q"],
axis="columns",
)
.astype(
{"From": "int32", "To": "int32", "H": "int32", "No": "int32", "RQ": "int32"}
)
)
And so:
ID From To H DP DR IV No RQ RM Q
0 MRC-17 447 472 -99 X X X -99 10 42 0.63
1 MRC-17 472 499 -99 X X X -99 20 42 2.50
0 MRC-17 499 504 1 U S D 7 50 32 2.08
1 MRC-17 504 510 2 P R D 7 25 32 0.78
2 MRC-17 510 545 0 P K F 9 5 18 0.02
3 MRC-17 545 565 0 P K F 8 60 28 0.33
4 MRC-17 565 575 0 P K F 9 5 18 0.02
5 MRC-17 575 581 1 P K F 7 70 34 0.49
6 MRC-17 581 600 0 P K F 8 20 23 0.11
7 MRC-17 600 612 0 P K F 9 5 18 0.02
8 MRC-17 612 634 1 P S C 7 70 38 2.92
9 MRC-17 634 647 0 P S F 9 5 22 0.04
10 MRC-17 647 662 2 P S B 7 55 39 4.58
11 MRC-17 662 677 0 P S F 9 15 22 0.13
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.