[英]Pandas: Selecting multiple rows based on column pair
我現在正在將我的數據分析管道從寬格式調整為整潔/長格式,並且在過濾它時遇到了問題,我只是無法理解它。
我的數據(簡化)看起來像這樣(顯微鏡強度數據):在一組的每次測量中,我有幾個感興趣的區域 = roi ,我正在查看幾個時間點的強度(=值)。
roi基本上是顯微鏡圖像中的單個細胞。 我正在跟蹤強度(= value )隨時間(= timepoint )的變化。 我重復這個實驗幾次(=測量),每次查看幾個細胞(= roi )。
我的目標是過濾掉所有時間點的測量 ROI,這些 ROI 在時間點 0 處的強度值高於我設定的閾值(我認為這些 ROI 是預先激活的)。
data = { "timepoint": [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
"measurement": [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3],
"roi":[1,1,1,1,2,2,2,2,3,3,3,3,1,1,1,1,1,1,1,1,2,2,2,2],
"value":[0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4],
"group": "control"
}
df = pd.DataFrame(data)
df
返回
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
8 0 1 3 0.5 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
現在我可以 select 包含在時間點 0 的值高於我的閾值的 ROI 的行
threshold = 0.4
pre_activated = df.loc[(df['timepoint'] == 0) & (df['value'] > threshold)]
pre_activated
返回
timepoint measurement roi value group
8 0 1 3 0.5 control
16 0 3 1 0.5 control
現在我想從原始 dataframe df
中過濾掉所有時間點 0 到 3 的那些單元格(例如測量 1,roi 3) - 這就是我現在卡住的地方。
如果我使用.isin
df.loc[~(df['measurement'].isin(pre_activated["measurement"]) & df['roi'].isin(pre_activated["roi"]))]
我會接近但measurement 1
和roi 1
對的所有內容都丟失了(所以我認為這是條件表達式的問題)
timepoint measurement roi value group
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
我知道我可以將.query
用於至少一個測量和 roi 對
df[~df.isin(df.query('measurement == 1 & roi == 3'))]
盡管所有整數都轉換為浮點數,但這會有點接近。 此外,“組”列現在是 NaN,當有多個具有多個測量值的組和每個 dataframe 的 rois 時,這將變得很困難
timepoint measurement roi value group
0 0.0 1.0 1.0 0.1 control
1 1.0 1.0 1.0 0.2 control
2 2.0 1.0 1.0 0.3 control
3 3.0 1.0 1.0 0.4 control
4 0.0 1.0 2.0 0.1 control
5 1.0 1.0 2.0 0.2 control
6 2.0 1.0 2.0 0.3 control
7 3.0 1.0 2.0 0.4 control
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 0.0 2.0 1.0 0.1 control
13 1.0 2.0 1.0 0.2 control
14 2.0 2.0 1.0 0.3 control
15 3.0 2.0 1.0 0.4 control
16 0.0 3.0 1.0 0.5 control
17 1.0 3.0 1.0 0.6 control
18 2.0 3.0 1.0 0.8 control
19 3.0 3.0 1.0 0.9 control
20 0.0 3.0 2.0 0.1 control
21 1.0 3.0 2.0 0.2 control
22 2.0 3.0 2.0 0.3 control
23 3.0 3.0 2.0 0.4 control
我嘗試使用存儲measurement
的字典: roi
對以避免任何混淆,但不知道這是否有用:
msmt_list = pre_activated["measurement"].values
roi_list = pre_activated["roi"].values
mydict={}
for i in range(len(msmt_list)):
mydict[msmt_list[i]]=roi_list[i]
output
mydict
{1: 3, 3: 1}
實現我想做的最好的方法是什么? 我會很感激任何輸入,同樣在效率方面,因為我通常處理 3-4 組,進行 4-8 次測量,每組最多 200 個 ROI,通常是 360 個時間點。
謝謝!
編輯:只是為了澄清我想要的 output 數據幀應該是什么樣子
'df_pre_activated'(那些是在時間點 0 的值高於我的閾值的“roi”)
timepoint measurement roi value group
8 0 1 3 0.5 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
“df_filtered”(這基本上是最初的“df”,沒有上面顯示的“df_pre_activated”中的數據)
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
解決方案如下:
首先,我們通過使用條件過濾df
來計算df_pre_activated_t0
:
threshold = 0.4
df_pre_activated_t0 = df[(df['timepoint'] == 0) & (df['value'] > threshold)]
df_pre_activated_t0
看起來像這樣:
timepoint measurement roi value group
8 0 1 3 0.5 control
16 0 3 1 0.5 control
我們通過合並df
和df_pre_activated_t0
(內部合並)來計算df_pre_activated
:
df_pre_activated = df.merge(
df_pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"]
)
df_pre_activated
看起來像這樣:
timepoint measurement roi value group
0 0 1 3 0.5 control
1 1 1 3 0.6 control
2 2 1 3 0.8 control
3 3 1 3 0.9 control
4 0 3 1 0.5 control
5 1 3 1 0.6 control
6 2 3 1 0.8 control
7 3 3 1 0.9 control
為了計算df_filtered
( df
沒有df_pre_activated
的行),我們在df
和df_pre_activated
之間進行左合並,並保留值不在df_pre_activated
中的行:
df_filtered = df.merge(
df_pre_activated,
how="left",
on=["timepoint", "measurement", "roi", "value"]
)
df_filtered = df_filtered[pd.isna(df_filtered["group_y"])]
df_filtered
看起來像這樣:
timepoint measurement roi value group_x group_y
0 0 1 1 0.1 control NaN
1 1 1 1 0.2 control NaN
2 2 1 1 0.3 control NaN
3 3 1 1 0.4 control NaN
4 0 1 2 0.1 control NaN
5 1 1 2 0.2 control NaN
6 2 1 2 0.3 control NaN
7 3 1 2 0.4 control NaN
12 0 2 1 0.1 control NaN
13 1 2 1 0.2 control NaN
14 2 2 1 0.3 control NaN
15 3 2 1 0.4 control NaN
20 0 3 2 0.1 control NaN
21 1 3 2 0.2 control NaN
22 2 3 2 0.3 control NaN
23 3 3 2 0.4 control NaN
最后,我們刪除group_y列,並將列名設置為其原始值:
df_filtered.drop("group_y", axis=1, inplace=True)
df_filtered.columns = list(df.columns)
df_filtered
看起來像這樣:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
就像這樣:
在:
df[(df["measurement"] != 1) | (df["roi"] != 3)]
出去:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
這是由於數學邏輯思維而發生的。 你在想。 給我看 dataframe,其中 a 不是 1,b 不是 3,這與給我看 dataframe 相同,其中 a 不是 1 或 b 是 3,從 Z6A55074B3DZD47554 中刪除 1 和 3
您必須使用 a is not 1 or b is not 3,這與 not a is 1 and b is not 3 相同。
希望這有幫助。 在一條線上。
編輯:要同時刪除 1:3 和 3:1,請將 AND 條件與兩個 OR 條件一起使用:
df[((df["measurement"] != 1) | (df["roi"] != 3)) & ((df["measurement"] != 3) | (df["roi"] != 1))]
Edit2:要直接刪除過濾的行,您可以使用先過濾然后刪除的逆操作。
在:
threshold = 0.4
full_activated = df5[(df5['timepoint'] != 0) | (df5['value'] < threshold)]
full_activated
出去:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
編輯3:
多個條件
threshold = 0.4
full_activated = df5[((df5['timepoint'] != 0) | (df5['value'] < threshold)) & ((df5["measurement"] != 1) | (df5["roi"] != 3)) & ((df5["measurement"] != 3) | (df5["roi"] != 1)) & ((df5["measurement"] != 1) | (df5["roi"] != 1)) ]
full_activated
Output:
timepoint measurement roi value group
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
感謝@Jose A. Jimenez 和@Vioxini 的回答。 我接受了 Jose 的建議,它給了我想要的 output。 我使用dask
進一步提高了性能
inputdf.shape
(73124, 5)
僅使用 pandas:
import pandas as pd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
filtereddf = inputdf.merge(
pre_activated,
how="left",
on=["timepoint", "measurement", "roi", "value"],
)
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)
需要 2 分 9 秒。
現在有了dask
:
import dask.dataframe as dd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
input_dd = dd.from_pandas(inputdf, npartitions=3)
pre_dd = dd.from_pandas(pre_activated, npartitions=3)
merger = dd.merge(input_dd,pre_dd, how="left", on=["timepoint", "measurement", "roi", "value"])
filtereddf = merger.compute()
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)
現在只需要 42.6 秒 :-)
這是我第一次使用 dask,所以可能有一些我不知道的選項可以進一步提高速度,但現在還可以。
再次感謝您的幫助!
編輯:
在將pandas dataframe
轉換為dask dataframe
dataframe 380 秒時,我使用了npartitions
選項,並將其從 3 秒提高到現在僅需要npartitions=30
秒:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.