Check if series of numbers is between two columns in pandas dataframe

Question

I am trying to classify genomic locations, and I have DataFrame like the following, with all locations and their respective classification types. The Type column does not have unique classifications, but each row will have a unique combination of Chr , Low , High .

pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})

I then have my sample set that needs to be classified like the DataFrame below.

pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})

To classify the data, for each location in the sample set, the Chromosome position found within the Chr column must match a Chr value found within the reference DataFrame and the Loc value must be >= the Low value and <= the High value. If this happens, that row should then be labeled with the respective Type in the reference DataFrame. In the example I provide, the sample set should be labeled like the following.

pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325],
    'Type':['Foo','None','None']
})

which looks like:

   Chr  Loc  Type
0    1  125   Foo
1    1  325  None
2    5  325  None

Answer 1

You could merge the two on "Chr". Then on the merged DataFrame, see if "Loc" falls between "Low" and "High" and use where to fill "Type" with NaN values if it doesn't. Finally, drop irrelevant columns and duplicate rows:

merged = sample.merge(df, on='Chr', how='left')
merged['Type'] = merged['Type'].where(merged['Loc'].between(merged['Low'], merged['High']))
out = merged.drop(columns=['Low','High']).drop_duplicates(subset=['Chr','Loc'])

Output:

   Chr  Loc Type
0    1  125  Foo
2    1  325  NaN
4    5  325  NaN

Answer 2

You can try apply() to check for each row if the conditions return a Type like this:

# Create your data frames
cat = pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})
test = pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})

# Check if there is type for each row
test['Type'] = test.apply(lambda x: cat[(cat['Chr']==x['Chr']) & (cat['Low'] < x['Loc']) & (cat['High'] > x['Loc'])]['Type'], axis=1)

test

Output:

    Chr Loc Type
0   1   125 Foo
1   1   325 NaN
2   5   325 NaN

Answer 3

You could try this:

df2 = df2.assign(Type=None)

for l in df2["Loc"]:
    i = list(df2["Loc"]).index(l)
    if df1["Low"][i] < l < df1["High"][i]:
        df2["Type"][i] = df1["Type"][i]

Output:

   Chr  Loc  Type
0    1  125   Foo
1    1  325  None
2    5  325  None

Answer 4

I'm not clear on the requirements for your question, as noted in a comment on the question above. However, if the currently accepted solution that only checks the conditions you specified row-to-row, then I think this is a better solution:

df1 = pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})
df2 = pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})
mask = (df2['Chr']==df1['Chr'])&(df2['Loc']>=df1['Low'])&(df2['Loc']<=df1['High'])
df2.loc[mask, 'Type'] = df1.loc[mask, 'Type']

Output:

If the requirement is to check each value of 'Chr' in the sample dataframe with ALL rows in the classification dataframe, df1 , that match the 'Chr' value, then this is a bit more complicated and I can update this answer accordingly, but I think @enke's solution handles this case correctly.

Check if series of numbers is between two columns in pandas dataframe

Question

4 answers

solution1
2 ACCPTED

solution2
1 2022-02-09 23:37:01

solution3
1 2022-02-09 23:37:35

solution4
1 2022-02-11 01:46:56

Check if series of numbers is between two columns in pandas dataframe

Question

4 answers

solution1 2 ACCPTED

solution2 1 2022-02-09 23:37:01

solution3 1 2022-02-09 23:37:35

solution4 1 2022-02-11 01:46:56

solution1
2 ACCPTED

solution2
1 2022-02-09 23:37:01

solution3
1 2022-02-09 23:37:35

solution4
1 2022-02-11 01:46:56