简体   繁体   中英

Pandas: Selecting multiple rows based on column pair

I am adapting my data analysis pipeline from a wide to the tidy/long format right now and have a problem with filtering it and I just cannot wrap my head around it.

My data (simplified) looks like this (microscopy intensity data): in each measurement of a group I have several regions of interest = roi where I am looking at the intensity (= value ) over several timepoints .

A roi is basically a single cell in the microscope image. I am following the change of intensity (= value ) over time (= timepoint ). I repeat this experiment a couple of times (= measurement ) looking at several cells (= roi ) each time.

My aim is to filter out those ROIs of a measurement for all timepoints, that have a intensity value higher than my set threshold at timepoint 0 (I considers these ROIs pre-activated).

 data = {  "timepoint": [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3], 
           "measurement": [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3], 
           "roi":[1,1,1,1,2,2,2,2,3,3,3,3,1,1,1,1,1,1,1,1,2,2,2,2],
           "value":[0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4],
           "group": "control"
       }
df = pd.DataFrame(data)
df

returns

  timepoint     measurement     roi     value   group
0   0                 1          1       0.1    control
1   1                 1          1       0.2    control
2   2                 1          1       0.3    control
3   3                 1          1       0.4    control
4   0                 1          2       0.1    control
5   1                 1          2       0.2    control
6   2                 1          2       0.3    control
7   3                 1          2       0.4    control
8   0                 1          3       0.5    control
9   1                 1          3       0.6    control
10  2                 1          3       0.8    control
11  3                 1          3       0.9    control
12  0                 2          1       0.1    control
13  1                 2          1       0.2    control
14  2                 2          1       0.3    control
15  3                 2          1       0.4    control
16  0                 3          1       0.5    control
17  1                 3          1       0.6    control
18  2                 3          1       0.8    control
19  3                 3          1       0.9    control
20  0                 3          2       0.1    control
21  1                 3          2       0.2    control
22  2                 3          2       0.3    control
23  3                 3          2       0.4    control

Now I can select the rows containing ROIs with a value at timepoint 0 higher than my threshold with

    threshold = 0.4
    pre_activated = df.loc[(df['timepoint'] == 0) & (df['value'] > threshold)]
    pre_activated

which returns

timepoint   measurement     roi     value   group
8   0            1           3       0.5    control
16  0            3           1       0.5    control

Now I want to filter out those cells (eg measurement 1, roi 3) for all timepoints 0 to 3 from the original dataframe df - this is the point I am stuck right now.

If I use .isin

df.loc[~(df['measurement'].isin(pre_activated["measurement"]) & df['roi'].isin(pre_activated["roi"]))]

I'll get close but also everything with the measurement 1 and roi 1 pair is missing (so I assume it is a problem with the condition expression)

   timepoint       measurement    roi    value      group
4   0                   1          2      0.1       control
5   1                   1          2      0.2       control
6   2                   1          2      0.3       control
7   3                   1          2      0.4       control
12  0                   2          1      0.1       control
13  1                   2          1      0.2       control
14  2                   2          1      0.3       control
15  3                   2          1      0.4       control
20  0                   3          2      0.1       control
21  1                   3          2      0.2       control
22  2                   3          2      0.3       control
23  3                   3          2      0.4       control

I know I can use .query for at least a single measurement & roi pair

df[~df.isin(df.query('measurement == 1 & roi == 3'))]

which will bring somewhat close, although all integers are converted to float. In addition the "group" column is now NaN, which would make it difficult when there are multiple groups with multiple measurements and rois per dataframe

   timepoint    measurement          roi     value  group
    0   0.0                   1.0        1.0     0.1    control
    1   1.0                   1.0        1.0     0.2    control
    2   2.0                   1.0        1.0     0.3    control
    3   3.0                   1.0        1.0     0.4    control
    4   0.0                   1.0        2.0     0.1    control
    5   1.0                   1.0        2.0     0.2    control
    6   2.0                   1.0        2.0     0.3    control
    7   3.0                   1.0        2.0     0.4    control
    8   NaN                   NaN        NaN     NaN    NaN
    9   NaN                   NaN        NaN     NaN    NaN
    10  NaN                   NaN        NaN     NaN    NaN
    11  NaN                   NaN        NaN     NaN    NaN
    12  0.0                   2.0        1.0     0.1    control
    13  1.0                   2.0        1.0     0.2    control
    14  2.0                   2.0        1.0     0.3    control
    15  3.0                   2.0        1.0     0.4    control
    16  0.0                   3.0        1.0     0.5    control
    17  1.0                   3.0        1.0     0.6    control
    18  2.0                   3.0        1.0     0.8    control
    19  3.0                   3.0        1.0     0.9    control
    20  0.0                   3.0        2.0     0.1    control
    21  1.0                   3.0        2.0     0.2    control
    22  2.0                   3.0        2.0     0.3    control
    23  3.0                   3.0        2.0     0.4    control

I tried to use a dict that stores the measurement : roi pair to avoid any mixup but do not really know whether this would be useful:

msmt_list = pre_activated["measurement"].values
roi_list = pre_activated["roi"].values

mydict={}
for i in range(len(msmt_list)):
    mydict[msmt_list[i]]=roi_list[i]

output

   mydict
    {1: 3, 3: 1}

What is the best way to achieve what I would like to do? I would appreciate any input, also in terms of efficency as I usually deal with 3-4 groups with 4-8 measurements and up to 200 ROIs each and usually 360 timepoints.

Thanks!

Edit: Just to clarify what my desired output dataframes should look like

´df_pre_activated´ (those are the "roi" that have a value higher than my threshold at timepoint 0)

  timepoint     measurement     roi     value   group
8   0                 1          3       0.5    control
9   1                 1          3       0.6    control
10  2                 1          3       0.8    control
11  3                 1          3       0.9    control
16  0                 3          1       0.5    control
17  1                 3          1       0.6    control
18  2                 3          1       0.8    control
19  3                 3          1       0.9    control

´df_filtered´ (this is basically the initial ´df´ without the data in ´df_pre_activated´ shown above)

      timepoint     measurement     roi     value   group
0   0                 1          1       0.1    control
1   1                 1          1       0.2    control
2   2                 1          1       0.3    control
3   3                 1          1       0.4    control
4   0                 1          2       0.1    control
5   1                 1          2       0.2    control
6   2                 1          2       0.3    control
7   3                 1          2       0.4    control
12  0                 2          1       0.1    control
13  1                 2          1       0.2    control
14  2                 2          1       0.3    control
15  3                 2          1       0.4    control
20  0                 3          2       0.1    control
21  1                 3          2       0.2    control
22  2                 3          2       0.3    control
23  3                 3          2       0.4    control

A solution would be the following:

First, we calculate df_pre_activated_t0 by filtering df with the condition:

threshold = 0.4
df_pre_activated_t0 = df[(df['timepoint'] == 0) & (df['value'] > threshold)]

df_pre_activated_t0 looks like this:

    timepoint  measurement  roi  value    group
8           0            1    3    0.5  control
16          0            3    1    0.5  control

We calculate df_pre_activated by merging df and df_pre_activated_t0 ( inner merge):

df_pre_activated = df.merge(
    df_pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"]
)

df_pre_activated looks like this:

   timepoint  measurement  roi  value    group
0          0            1    3    0.5  control
1          1            1    3    0.6  control
2          2            1    3    0.8  control
3          3            1    3    0.9  control
4          0            3    1    0.5  control
5          1            3    1    0.6  control
6          2            3    1    0.8  control
7          3            3    1    0.9  control

To calculate df_filtered ( df without the rows of df_pre_activated ), we do a left merge between df and df_pre_activated and keep the rows where values are not in df_pre_activated :

df_filtered = df.merge(
    df_pre_activated,
    how="left",
    on=["timepoint", "measurement", "roi", "value"]
)

df_filtered = df_filtered[pd.isna(df_filtered["group_y"])]

df_filtered looks like this:

    timepoint  measurement  roi  value  group_x group_y
0           0            1    1    0.1  control     NaN
1           1            1    1    0.2  control     NaN
2           2            1    1    0.3  control     NaN
3           3            1    1    0.4  control     NaN
4           0            1    2    0.1  control     NaN
5           1            1    2    0.2  control     NaN
6           2            1    2    0.3  control     NaN
7           3            1    2    0.4  control     NaN
12          0            2    1    0.1  control     NaN
13          1            2    1    0.2  control     NaN
14          2            2    1    0.3  control     NaN
15          3            2    1    0.4  control     NaN
20          0            3    2    0.1  control     NaN
21          1            3    2    0.2  control     NaN
22          2            3    2    0.3  control     NaN
23          3            3    2    0.4  control     NaN

Finally, we drop the group_y column and we set the column names to their original values:

df_filtered.drop("group_y", axis=1, inplace=True)
df_filtered.columns = list(df.columns)

df_filtered looks like this:

    timepoint  measurement  roi  value    group
0           0            1    1    0.1  control
1           1            1    1    0.2  control
2           2            1    1    0.3  control
3           3            1    1    0.4  control
4           0            1    2    0.1  control
5           1            1    2    0.2  control
6           2            1    2    0.3  control
7           3            1    2    0.4  control
12          0            2    1    0.1  control
13          1            2    1    0.2  control
14          2            2    1    0.3  control
15          3            2    1    0.4  control
20          0            3    2    0.1  control
21          1            3    2    0.2  control
22          2            3    2    0.3  control
23          3            3    2    0.4  control

Simply as this:

In:

df[(df["measurement"] != 1) | (df["roi"] != 3)]

Out:

timepoint   measurement     roi     value   group
0   0   1   1   0.1     control
1   1   1   1   0.2     control
2   2   1   1   0.3     control
3   3   1   1   0.4     control
4   0   1   2   0.1     control
5   1   1   2   0.2     control
6   2   1   2   0.3     control
7   3   1   2   0.4     control
12  0   2   1   0.1     control
13  1   2   1   0.2     control
14  2   2   1   0.3     control
15  3   2   1   0.4     control
16  0   3   1   0.5     control
17  1   3   1   0.6     control
18  2   3   1   0.8     control
19  3   3   1   0.9     control
20  0   3   2   0.1     control
21  1   3   2   0.2     control
22  2   3   2   0.3     control
23  3   3   2   0.4     control

This happens due to mathematical logic thinking. You are thinking. Show me the dataframe where a is not 1 and b is not 3, which is the same as show me the dataframe where not a is 1 or b is 3, which removes both 1 and 3 from the dataframe.

You must use a is not 1 or b is not 3, which is the same as not a is 1 and b is not 3.

Hope this helped. In one line.

Edit: To remove both 1: 3 and 3: 1, use the AND condition with the both OR conditions:

df[((df["measurement"] != 1) | (df["roi"] != 3)) & ((df["measurement"] != 3) | (df["roi"] != 1))]

Edit2: To remove directly the filtered rows, you can use the inverse of first filtering and then removing.

In:

threshold = 0.4
full_activated = df5[(df5['timepoint'] != 0) | (df5['value'] < threshold)]
full_activated

Out:

    timepoint   measurement     roi     value   group
0   0   1   1   0.1     control
1   1   1   1   0.2     control
2   2   1   1   0.3     control
3   3   1   1   0.4     control
4   0   1   2   0.1     control
5   1   1   2   0.2     control
6   2   1   2   0.3     control
7   3   1   2   0.4     control
9   1   1   3   0.6     control
10  2   1   3   0.8     control
11  3   1   3   0.9     control
12  0   2   1   0.1     control
13  1   2   1   0.2     control
14  2   2   1   0.3     control
15  3   2   1   0.4     control
17  1   3   1   0.6     control
18  2   3   1   0.8     control
19  3   3   1   0.9     control
20  0   3   2   0.1     control
21  1   3   2   0.2     control
22  2   3   2   0.3     control
23  3   3   2   0.4     control

Edit 3:

Multiple conditions

threshold = 0.4
full_activated = df5[((df5['timepoint'] != 0) | (df5['value'] < threshold)) & ((df5["measurement"] != 1) | (df5["roi"] != 3)) & ((df5["measurement"] != 3) | (df5["roi"] != 1)) & ((df5["measurement"] != 1) | (df5["roi"] != 1)) ]
full_activated

Output:

timepoint   measurement     roi     value   group
4   0   1   2   0.1     control
5   1   1   2   0.2     control
6   2   1   2   0.3     control
7   3   1   2   0.4     control
12  0   2   1   0.1     control
13  1   2   1   0.2     control
14  2   2   1   0.3     control
15  3   2   1   0.4     control
20  0   3   2   0.1     control
21  1   3   2   0.2     control
22  2   3   2   0.3     control
23  3   3   2   0.4     control

thanks to @Jose A. Jimenez and @Vioxini for their answers. I went with Jose's suggestion and it gives me the output I wanted. I further improved performance using dask

inputdf.shape
(73124, 5)

Using only pandas:

import pandas as pd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
    
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
filtereddf = inputdf.merge(
    pre_activated,
    how="left",
    on=["timepoint", "measurement", "roi", "value"],  
    )
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)

takes 2min 9 seconds.

Now with dask :

import dask.dataframe as dd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]   
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])

input_dd = dd.from_pandas(inputdf, npartitions=3)
pre_dd = dd.from_pandas(pre_activated, npartitions=3)

merger = dd.merge(input_dd,pre_dd, how="left", on=["timepoint", "measurement", "roi", "value"])
filtereddf = merger.compute()
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])] 
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)

this now only takes 42.6 s:-)

It is my first time using dask so there might be some options I am not aware of that could further improve speed, but for now that is ok.

Thanks again for the help!

Edit:

I played around with the npartitions option when converting the pandas dataframe to a dask dataframe and increasing it from 3 to npartitions=30 further improves performance: it now takes only 9.87 seconds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM