查找熊貓數據框列的所有唯一組合

Question

我手頭有一個數據平衡問題，其中我有多個類的圖像，即每個圖像可以有多個類或一個類。 我有標簽文件，其中包含從 A 到 G 和 fn（圖像名稱）命名的所有類作為列。 每列都有一個值 0 或 1，其中 0 表示圖像中不存在類，1 表示圖像中存在特定類。 現在，我想以這樣一種方式對數據框進行子集化，即我得到不同的數據框，每個數據框都有不同類的組合

問題是如果我將多個條件與 dataframe 命令一起使用，例如（這里 pp 用於表示 dataframe ：

pp_A_B=pp[(pp['A']==1) & (pp['B']==1) & (pp['C']==0) & (pp['D']==0) & (x['E']==0) & (x['F']==0) &(pp['G']==0)]

在這里，pp_A_B 給了我只有 A 和 B 類的圖像的數據框。

我將不得不編寫多個變量來了解各種組合。請幫助我們如何自動化它以更快地獲得所有可能的組合。

Answer 1

嗨，您應該使用groupyby和get_group方法來提取所需的元素。

這是一個示例，如果您嘗試獲取 A = 0 & B= 0 的數據：

#Simulation of your datas
nb_rows = 10000
nb_colums = 5
df_array = np.random.randint(0,2, size =(nb_rows, nb_colums))
df = pd.DataFrame(df_array)
df.columns = ["A", "B", "C", "D", "E"]
df["infos"] = [f"Exemples of data {i}" for i in range(len(df))]

更新：

現在使用上述方法：

df.groupby(["A", "B"]).get_group((0, 0))

在這里，您可以輕松找到滿足 A = 0 和 B = 0 的所有數據。

現在您可以通過這種方式迭代所有目標列組合：

columns_to_explore = ["A", "B", "C"]
k = [0]*len(columns_to_explore)
for i in range(2**len(columns_to_explore)):
    i_binary = str(bin(i)[2:])
    i_binary = "".join(["0" for _ in range(len(columns_to_explore)-len(i_binary))]) + i_binary
    list_values = [int(x) for x in i_binary]
    df_selected = df.groupby(columns_to_explore).get_group(tuple(list_values))
    #Do something then ...

Answer 2

讓我們假設您有以下數據框：

import pandas as pd
import random


attr = [0, 1]
N = 10000
rg = range(N)

df = pd.DataFrame(
    {
        'A': [random.choice(attr) for i in rg],
        'B': [random.choice(attr) for i in rg],
        'C': [random.choice(attr) for i in rg],
        'D': [random.choice(attr) for i in rg],
        'E': [random.choice(attr) for i in rg],
        'F': [random.choice(attr) for i in rg],
        'G': [random.choice(attr) for i in rg],
    }
)

並且您希望將所有數據框組合存儲在列表中。 然后，您可以編寫以下函數來獲取對應於0和1的相同組合的所有索引：

import random
from numba import njit

@njit
def _get_index_combinations(possible_combinations, values):
    index_outpus = []
    for combination in possible_combinations:
        mask = values == combination
        _temp = [i for i in range(len(mask)) if mask[i].all()]
        index_outpus.append(_temp)
    return index_outpus

possible_combinations = df.drop_duplicates().values
index_outpus = _get_index_combinations(possible_combinations, df.values)

最后，您可以通過迭代所有索引組合來將數據幀分解為塊：

sliced_dfs = [df.loc[df.index.isin(index)] for index in index_outpus]

如果你然后，例如，運行

print(sliced_dfs[0])

您將獲得一種可能組合的查詢。

筆記：

您甚至可以更進一步，為所有可能的組合創建多個數據框（不存儲在列表中）。 如果你變臟並使用這樣的東西：

col_names = "ABCDEFG"
final_output = {"all_names": [], "all_querys": []}
for numerator, i in enumerate(possible_combinations):
    df_name = ""
    col_pos = np.where(i)[0]
    for pos in col_pos:
        df_name += col_names[pos]
    final_output["all_names"].append(f"df_{df_name}")
    query_code = f"df_{df_name} = df.loc[df.index.isin({index_outpus[numerator]})]"
    final_output["all_querys"].append(query_code)
    exec(query_code)

它會為您創建一個名為final_output的字典。 在那里，存儲了所有創建的數據框的名稱。 例如：

{'all_names': ['df_ABG', 'df_G', 'df_AC', ...], 'all_querys': [...]}

然后，您可以只打印all_names中的所有幀，例如df_ABG ，它會返回：

      A  B  C  D  E  F  G
0     1  1  0  0  0  0  1
59    1  1  0  0  0  0  1
92    1  1  0  0  0  0  1
207   1  1  0  0  0  0  1
211   1  1  0  0  0  0  1
284   1  1  0  0  0  0  1
321   1  1  0  0  0  0  1
387   1  1  0  0  0  0  1
415   1  1  0  0  0  0  1
637   1  1  0  0  0  0  1
....

查找熊貓數據框列的所有唯一組合

問題描述

2 個解決方案

解決方案1
1 2022-06-22 14:04:20

解決方案2
0 2022-06-22 14:41:28

筆記：

查找熊貓數據框列的所有唯一組合

問題描述

2 個解決方案

解決方案1 1 2022-06-22 14:04:20

解決方案2 0 2022-06-22 14:41:28

筆記：

解決方案1
1 2022-06-22 14:04:20

解決方案2
0 2022-06-22 14:41:28