Groupby 和 count 具有多個值的列

Question

鑒於此數據框：

df = pd.DataFrame({
    "names": [["Kevin, Jack"], ["Antoine, Mary, Johanne, Iv"], ["Ali"]],
    "commented": [["Kevin, Antoine, Iv"], ["Antoine, Mary, Ali"], ["Mary, Jack"]],
}, index=["1", "2", "3"])

看起來像這樣：

    names   commented
1   [Kevin, Jack]   [Kevin, Antoine, Iv]
2   [Antoine, Mary, Johanne, Iv]    [Antoine, Mary, Ali]
3   [Ali]   [Mary, Jack]

我想獲得一個新的數據框，它將計算所有人發表的所有評論。 就像是：

	凱文	傑克	安托萬	瑪麗	Ⅳ	阿里
凱文	1	0	1	0	1	0
傑克	1	0	1	0	1	0
安托萬	0	0	1	1	0	1
瑪麗	0	0	1	1	0	1
約翰妮	0	0	1	1	0	1
Ⅳ	0	0	1	1	0	1
阿里	0	1	0	1	0	0

這個數據框可能太小而無法理解，但我的原始數據框是 100k 行，並且會有高於 0 和 1 的數字。

我已經查看了使用 pivot_table 和 group by 的幾種變體的各種選項，但我似乎無法弄清楚這一點。

df.pivot_table(index = 'names', columns= 'commented', aggfunc= 'count')

df.groupby('names').commented.apply(list).reset_index()
df.explode('names')['commented'].value_counts()

df.set_index('names').apply(pd.Series.explode).reset_index()

我嘗試過的幾乎所有解決方案都給我錯誤： TypeError: unhashable type: 'list'

Answer 1

您可以嘗試將字符串列表分解為行，然后使用pandas.crosstab

df = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split(', '))
      .explode('names')
      .explode('commented'))

out = pd.crosstab(df['names'], df['commented'])

print(df)

     names commented
1    Kevin     Kevin
1    Kevin   Antoine
1    Kevin        Iv
1     Jack     Kevin
1     Jack   Antoine
1     Jack        Iv
2  Antoine   Antoine
2  Antoine      Mary
2  Antoine       Ali
2     Mary   Antoine
2     Mary      Mary
2     Mary       Ali
2  Johanne   Antoine
2  Johanne      Mary
2  Johanne       Ali
2       Iv   Antoine
2       Iv      Mary
2       Iv       Ali
3      Ali      Mary
3      Ali      Jack

print(out)

commented  Ali  Antoine  Iv  Jack  Kevin  Mary
names
Ali          0        0   0     1      0     1
Antoine      1        1   0     0      0     1
Iv           1        1   0     0      0     1
Jack         0        1   1     0      1     0
Johanne      1        1   0     0      0     1
Kevin        0        1   1     0      1     0
Mary         1        1   0     0      0     1

Answer 2

在您的示例輸入中， names和commented列中的每個元素都是一個只有 1 個元素（字符串）的數組。 不確定您的真實數據是否如此。

您可以用逗號分割每個字符串，然后分解和旋轉數據框：

split = lambda x: x[0].split(", ")
(
    df.assign(
        names=df["names"].apply(split),
        commented=df["commented"].apply(split),
        dummy=1
    )
    .explode("names")
    .explode("commented")
    .pivot_table(index="names", columns="commented", values="dummy", aggfunc="count", fill_value=0)
)

Answer 3

這是使用str.get_dummies()的另一種方法

(df.assign(names = df['names'].str[0].str.split(', '))
.explode('names')
.set_index('names')
.squeeze()
.str[0]
.str.get_dummies(sep=', '))

輸出：

         Ali  Antoine  Iv  Jack  Kevin  Mary
names                                       
Kevin      0        1   1     0      1     0
Jack       0        1   1     0      1     0
Antoine    1        1   0     0      0     1
Mary       1        1   0     0      0     1
Johanne    1        1   0     0      0     1
Iv         1        1   0     0      0     1
Ali        0        0   0     1      0     1

Groupby 和 count 具有多個值的列

問題描述

3 個解決方案

解決方案1
1 已采納 2022-05-28 15:44:03

解決方案2
0 2022-05-28 14:44:18

解決方案3
0 2022-05-28 17:06:04

Groupby 和 count 具有多個值的列

問題描述

3 個解決方案

解決方案1 1 已采納 2022-05-28 15:44:03

解決方案2 0 2022-05-28 14:44:18

解決方案3 0 2022-05-28 17:06:04

解決方案1
1 已采納 2022-05-28 15:44:03

解決方案2
0 2022-05-28 14:44:18

解決方案3
0 2022-05-28 17:06:04