Pandas：根據列對選擇多行

Question

我現在正在將我的數據分析管道從寬格式調整為整潔/長格式，並且在過濾它時遇到了問題，我只是無法理解它。

我的數據（簡化）看起來像這樣（顯微鏡強度數據）：在一組的每次測量中，我有幾個感興趣的區域 = roi ，我正在查看幾個時間點的強度（=值）。

roi基本上是顯微鏡圖像中的單個細胞。 我正在跟蹤強度（= value ）隨時間（= timepoint ）的變化。 我重復這個實驗幾次（=測量），每次查看幾個細胞（= roi ）。

我的目標是過濾掉所有時間點的測量 ROI，這些 ROI 在時間點 0 處的強度值高於我設定的閾值（我認為這些 ROI 是預先激活的）。

 data = {  "timepoint": [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3], 
           "measurement": [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3], 
           "roi":[1,1,1,1,2,2,2,2,3,3,3,3,1,1,1,1,1,1,1,1,2,2,2,2],
           "value":[0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4],
           "group": "control"
       }
df = pd.DataFrame(data)
df

返回

  timepoint     measurement     roi     value   group
0   0                 1          1       0.1    control
1   1                 1          1       0.2    control
2   2                 1          1       0.3    control
3   3                 1          1       0.4    control
4   0                 1          2       0.1    control
5   1                 1          2       0.2    control
6   2                 1          2       0.3    control
7   3                 1          2       0.4    control
8   0                 1          3       0.5    control
9   1                 1          3       0.6    control
10  2                 1          3       0.8    control
11  3                 1          3       0.9    control
12  0                 2          1       0.1    control
13  1                 2          1       0.2    control
14  2                 2          1       0.3    control
15  3                 2          1       0.4    control
16  0                 3          1       0.5    control
17  1                 3          1       0.6    control
18  2                 3          1       0.8    control
19  3                 3          1       0.9    control
20  0                 3          2       0.1    control
21  1                 3          2       0.2    control
22  2                 3          2       0.3    control
23  3                 3          2       0.4    control

現在我可以 select 包含在時間點 0 的值高於我的閾值的 ROI 的行

    threshold = 0.4
    pre_activated = df.loc[(df['timepoint'] == 0) & (df['value'] > threshold)]
    pre_activated

返回

timepoint   measurement     roi     value   group
8   0            1           3       0.5    control
16  0            3           1       0.5    control

現在我想從原始 dataframe df中過濾掉所有時間點 0 到 3 的那些單元格（例如測量 1，roi 3） - 這就是我現在卡住的地方。

如果我使用.isin

df.loc[~(df['measurement'].isin(pre_activated["measurement"]) & df['roi'].isin(pre_activated["roi"]))]

我會接近但measurement 1和roi 1對的所有內容都丟失了（所以我認為這是條件表達式的問題）

   timepoint       measurement    roi    value      group
4   0                   1          2      0.1       control
5   1                   1          2      0.2       control
6   2                   1          2      0.3       control
7   3                   1          2      0.4       control
12  0                   2          1      0.1       control
13  1                   2          1      0.2       control
14  2                   2          1      0.3       control
15  3                   2          1      0.4       control
20  0                   3          2      0.1       control
21  1                   3          2      0.2       control
22  2                   3          2      0.3       control
23  3                   3          2      0.4       control

我知道我可以將.query用於至少一個測量和 roi 對

df[~df.isin(df.query('measurement == 1 & roi == 3'))]

盡管所有整數都轉換為浮點數，但這會有點接近。 此外，“組”列現在是 NaN，當有多個具有多個測量值的組和每個 dataframe 的 rois 時，這將變得很困難

   timepoint    measurement          roi     value  group
    0   0.0                   1.0        1.0     0.1    control
    1   1.0                   1.0        1.0     0.2    control
    2   2.0                   1.0        1.0     0.3    control
    3   3.0                   1.0        1.0     0.4    control
    4   0.0                   1.0        2.0     0.1    control
    5   1.0                   1.0        2.0     0.2    control
    6   2.0                   1.0        2.0     0.3    control
    7   3.0                   1.0        2.0     0.4    control
    8   NaN                   NaN        NaN     NaN    NaN
    9   NaN                   NaN        NaN     NaN    NaN
    10  NaN                   NaN        NaN     NaN    NaN
    11  NaN                   NaN        NaN     NaN    NaN
    12  0.0                   2.0        1.0     0.1    control
    13  1.0                   2.0        1.0     0.2    control
    14  2.0                   2.0        1.0     0.3    control
    15  3.0                   2.0        1.0     0.4    control
    16  0.0                   3.0        1.0     0.5    control
    17  1.0                   3.0        1.0     0.6    control
    18  2.0                   3.0        1.0     0.8    control
    19  3.0                   3.0        1.0     0.9    control
    20  0.0                   3.0        2.0     0.1    control
    21  1.0                   3.0        2.0     0.2    control
    22  2.0                   3.0        2.0     0.3    control
    23  3.0                   3.0        2.0     0.4    control

我嘗試使用存儲measurement的字典： roi對以避免任何混淆，但不知道這是否有用：

msmt_list = pre_activated["measurement"].values
roi_list = pre_activated["roi"].values

mydict={}
for i in range(len(msmt_list)):
    mydict[msmt_list[i]]=roi_list[i]

output

   mydict
    {1: 3, 3: 1}

實現我想做的最好的方法是什么？ 我會很感激任何輸入，同樣在效率方面，因為我通常處理 3-4 組，進行 4-8 次測量，每組最多 200 個 ROI，通常是 360 個時間點。

謝謝！

編輯：只是為了澄清我想要的 output 數據幀應該是什么樣子

'df_pre_activated'（那些是在時間點 0 的值高於我的閾值的“roi”）

  timepoint     measurement     roi     value   group
8   0                 1          3       0.5    control
9   1                 1          3       0.6    control
10  2                 1          3       0.8    control
11  3                 1          3       0.9    control
16  0                 3          1       0.5    control
17  1                 3          1       0.6    control
18  2                 3          1       0.8    control
19  3                 3          1       0.9    control

“df_filtered”（這基本上是最初的“df”，沒有上面顯示的“df_pre_activated”中的數據）

      timepoint     measurement     roi     value   group
0   0                 1          1       0.1    control
1   1                 1          1       0.2    control
2   2                 1          1       0.3    control
3   3                 1          1       0.4    control
4   0                 1          2       0.1    control
5   1                 1          2       0.2    control
6   2                 1          2       0.3    control
7   3                 1          2       0.4    control
12  0                 2          1       0.1    control
13  1                 2          1       0.2    control
14  2                 2          1       0.3    control
15  3                 2          1       0.4    control
20  0                 3          2       0.1    control
21  1                 3          2       0.2    control
22  2                 3          2       0.3    control
23  3                 3          2       0.4    control

Answer 1

解決方案如下：

首先，我們通過使用條件過濾df來計算df_pre_activated_t0 ：

threshold = 0.4
df_pre_activated_t0 = df[(df['timepoint'] == 0) & (df['value'] > threshold)]

df_pre_activated_t0看起來像這樣：

    timepoint  measurement  roi  value    group
8           0            1    3    0.5  control
16          0            3    1    0.5  control

我們通過合並df和df_pre_activated_t0 （內部合並）來計算df_pre_activated ：

df_pre_activated = df.merge(
    df_pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"]
)

df_pre_activated看起來像這樣：

   timepoint  measurement  roi  value    group
0          0            1    3    0.5  control
1          1            1    3    0.6  control
2          2            1    3    0.8  control
3          3            1    3    0.9  control
4          0            3    1    0.5  control
5          1            3    1    0.6  control
6          2            3    1    0.8  control
7          3            3    1    0.9  control

為了計算df_filtered （ df沒有df_pre_activated的行），我們在df和df_pre_activated之間進行左合並，並保留值不在df_pre_activated中的行：

df_filtered = df.merge(
    df_pre_activated,
    how="left",
    on=["timepoint", "measurement", "roi", "value"]
)

df_filtered = df_filtered[pd.isna(df_filtered["group_y"])]

df_filtered看起來像這樣：

    timepoint  measurement  roi  value  group_x group_y
0           0            1    1    0.1  control     NaN
1           1            1    1    0.2  control     NaN
2           2            1    1    0.3  control     NaN
3           3            1    1    0.4  control     NaN
4           0            1    2    0.1  control     NaN
5           1            1    2    0.2  control     NaN
6           2            1    2    0.3  control     NaN
7           3            1    2    0.4  control     NaN
12          0            2    1    0.1  control     NaN
13          1            2    1    0.2  control     NaN
14          2            2    1    0.3  control     NaN
15          3            2    1    0.4  control     NaN
20          0            3    2    0.1  control     NaN
21          1            3    2    0.2  control     NaN
22          2            3    2    0.3  control     NaN
23          3            3    2    0.4  control     NaN

最后，我們刪除group_y列，並將列名設置為其原始值：

df_filtered.drop("group_y", axis=1, inplace=True)
df_filtered.columns = list(df.columns)

df_filtered看起來像這樣：

    timepoint  measurement  roi  value    group
0           0            1    1    0.1  control
1           1            1    1    0.2  control
2           2            1    1    0.3  control
3           3            1    1    0.4  control
4           0            1    2    0.1  control
5           1            1    2    0.2  control
6           2            1    2    0.3  control
7           3            1    2    0.4  control
12          0            2    1    0.1  control
13          1            2    1    0.2  control
14          2            2    1    0.3  control
15          3            2    1    0.4  control
20          0            3    2    0.1  control
21          1            3    2    0.2  control
22          2            3    2    0.3  control
23          3            3    2    0.4  control

Answer 2

就像這樣：

在：

df[(df["measurement"] != 1) | (df["roi"] != 3)]

出去：

timepoint   measurement     roi     value   group
0   0   1   1   0.1     control
1   1   1   1   0.2     control
2   2   1   1   0.3     control
3   3   1   1   0.4     control
4   0   1   2   0.1     control
5   1   1   2   0.2     control
6   2   1   2   0.3     control
7   3   1   2   0.4     control
12  0   2   1   0.1     control
13  1   2   1   0.2     control
14  2   2   1   0.3     control
15  3   2   1   0.4     control
16  0   3   1   0.5     control
17  1   3   1   0.6     control
18  2   3   1   0.8     control
19  3   3   1   0.9     control
20  0   3   2   0.1     control
21  1   3   2   0.2     control
22  2   3   2   0.3     control
23  3   3   2   0.4     control

這是由於數學邏輯思維而發生的。 你在想。 給我看 dataframe，其中 a 不是 1，b 不是 3，這與給我看 dataframe 相同，其中 a 不是 1 或 b 是 3，從 Z6A55074B3DZD47554 中刪除 1 和 3

您必須使用 a is not 1 or b is not 3，這與 not a is 1 and b is not 3 相同。

希望這有幫助。 在一條線上。

編輯：要同時刪除 1:3 和 3:1，請將 AND 條件與兩個 OR 條件一起使用：

df[((df["measurement"] != 1) | (df["roi"] != 3)) & ((df["measurement"] != 3) | (df["roi"] != 1))]

Edit2：要直接刪除過濾的行，您可以使用先過濾然后刪除的逆操作。

在：

threshold = 0.4
full_activated = df5[(df5['timepoint'] != 0) | (df5['value'] < threshold)]
full_activated

出去：

    timepoint   measurement     roi     value   group
0   0   1   1   0.1     control
1   1   1   1   0.2     control
2   2   1   1   0.3     control
3   3   1   1   0.4     control
4   0   1   2   0.1     control
5   1   1   2   0.2     control
6   2   1   2   0.3     control
7   3   1   2   0.4     control
9   1   1   3   0.6     control
10  2   1   3   0.8     control
11  3   1   3   0.9     control
12  0   2   1   0.1     control
13  1   2   1   0.2     control
14  2   2   1   0.3     control
15  3   2   1   0.4     control
17  1   3   1   0.6     control
18  2   3   1   0.8     control
19  3   3   1   0.9     control
20  0   3   2   0.1     control
21  1   3   2   0.2     control
22  2   3   2   0.3     control
23  3   3   2   0.4     control

編輯3：

多個條件

threshold = 0.4
full_activated = df5[((df5['timepoint'] != 0) | (df5['value'] < threshold)) & ((df5["measurement"] != 1) | (df5["roi"] != 3)) & ((df5["measurement"] != 3) | (df5["roi"] != 1)) & ((df5["measurement"] != 1) | (df5["roi"] != 1)) ]
full_activated

Output：

timepoint   measurement     roi     value   group
4   0   1   2   0.1     control
5   1   1   2   0.2     control
6   2   1   2   0.3     control
7   3   1   2   0.4     control
12  0   2   1   0.1     control
13  1   2   1   0.2     control
14  2   2   1   0.3     control
15  3   2   1   0.4     control
20  0   3   2   0.1     control
21  1   3   2   0.2     control
22  2   3   2   0.3     control
23  3   3   2   0.4     control

Answer 3

感謝@Jose A. Jimenez 和@Vioxini 的回答。 我接受了 Jose 的建議，它給了我想要的 output。 我使用dask進一步提高了性能

inputdf.shape
(73124, 5)

僅使用 pandas：

import pandas as pd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
    
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
filtereddf = inputdf.merge(
    pre_activated,
    how="left",
    on=["timepoint", "measurement", "roi", "value"],  
    )
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)

需要 2 分 9 秒。

現在有了dask ：

import dask.dataframe as dd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]   
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])

input_dd = dd.from_pandas(inputdf, npartitions=3)
pre_dd = dd.from_pandas(pre_activated, npartitions=3)

merger = dd.merge(input_dd,pre_dd, how="left", on=["timepoint", "measurement", "roi", "value"])
filtereddf = merger.compute()
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])] 
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)

現在只需要 42.6 秒 :-)

這是我第一次使用 dask，所以可能有一些我不知道的選項可以進一步提高速度，但現在還可以。

再次感謝您的幫助！

編輯：

在將pandas dataframe轉換為dask dataframe dataframe 380 秒時，我使用了npartitions選項，並將其從 3 秒提高到現在僅需要npartitions=30秒：

Pandas：根據列對選擇多行

問題描述

3 個解決方案

解決方案1
2 已采納

解決方案2
0 2020-07-17 15:52:24

解決方案3
0 2020-07-18 08:16:09

Pandas：根據列對選擇多行

問題描述

3 個解決方案

解決方案1 2 已采納

解決方案2 0 2020-07-17 15:52:24

解決方案3 0 2020-07-18 08:16:09

解決方案1
2 已采納

解決方案2
0 2020-07-17 15:52:24

解決方案3
0 2020-07-18 08:16:09