简体   繁体   中英

Check if series of numbers is between two columns in pandas dataframe

I am trying to classify genomic locations, and I have DataFrame like the following, with all locations and their respective classification types. The Type column does not have unique classifications, but each row will have a unique combination of Chr , Low , High .

pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})

I then have my sample set that needs to be classified like the DataFrame below.

pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})

To classify the data, for each location in the sample set, the Chromosome position found within the Chr column must match a Chr value found within the reference DataFrame and the Loc value must be >= the Low value and <= the High value. If this happens, that row should then be labeled with the respective Type in the reference DataFrame. In the example I provide, the sample set should be labeled like the following.

pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325],
    'Type':['Foo','None','None']
})

which looks like:

   Chr  Loc  Type
0    1  125   Foo
1    1  325  None
2    5  325  None

You could merge the two on "Chr". Then on the merged DataFrame, see if "Loc" falls between "Low" and "High" and use where to fill "Type" with NaN values if it doesn't. Finally, drop irrelevant columns and duplicate rows:

merged = sample.merge(df, on='Chr', how='left')
merged['Type'] = merged['Type'].where(merged['Loc'].between(merged['Low'], merged['High']))
out = merged.drop(columns=['Low','High']).drop_duplicates(subset=['Chr','Loc'])

Output:

   Chr  Loc Type
0    1  125  Foo
2    1  325  NaN
4    5  325  NaN

You can try apply() to check for each row if the conditions return a Type like this:

# Create your data frames
cat = pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})
test = pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})

# Check if there is type for each row
test['Type'] = test.apply(lambda x: cat[(cat['Chr']==x['Chr']) & (cat['Low'] < x['Loc']) & (cat['High'] > x['Loc'])]['Type'], axis=1)

test

Output:

    Chr Loc Type
0   1   125 Foo
1   1   325 NaN
2   5   325 NaN

You could try this:

df2 = df2.assign(Type=None)

for l in df2["Loc"]:
    i = list(df2["Loc"]).index(l)
    if df1["Low"][i] < l < df1["High"][i]:
        df2["Type"][i] = df1["Type"][i]

Output:

   Chr  Loc  Type
0    1  125   Foo
1    1  325  None
2    5  325  None

I'm not clear on the requirements for your question, as noted in a comment on the question above. However, if the currently accepted solution that only checks the conditions you specified row-to-row, then I think this is a better solution:

df1 = pd.DataFrame({
    'Chr':[1,1,3],
    'Low':[100,200,300],
    'High':[150,250,350],
    'Type':['Foo','Bar','Foo']
})
df2 = pd.DataFrame({
    'Chr':[1,1,5],
    'Loc':[125,325,325]
})
mask = (df2['Chr']==df1['Chr'])&(df2['Loc']>=df1['Low'])&(df2['Loc']<=df1['High'])
df2.loc[mask, 'Type'] = df1.loc[mask, 'Type']

Output:

在此处输入图像描述

If the requirement is to check each value of 'Chr' in the sample dataframe with ALL rows in the classification dataframe, df1 , that match the 'Chr' value, then this is a bit more complicated and I can update this answer accordingly, but I think @enke's solution handles this case correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM