將列拆分為多個具有唯一值的列 pandas

Question

我有以下 dataframe：

   Col
0  A,B,C
1  B,A,D
2  C
3  A,D,E,F
4  B,C,F

df = pd.DataFrame({'Col': ['A,B,C', 'B,A,D', 'C', 'A,D,E,F', 'B,C,F']})

需要變成：

   A B C D E F
0  A B C
1  A B   D
2      C
3  A     D E F
4    B C     F

Answer 1

您可以使用str.get_dummies獲取虛擬變量，然后乘以列：

tmp = df['Col'].str.get_dummies(sep=',')
out = tmp * tmp.columns

@piRSquared 建議的單線：

out = df.Col.str.get_dummies(',').pipe(lambda d: d*[*d])

Output：

   A  B  C  D  E  F
0  A  B  C         
1  A  B     D      
2        C         
3  A        D  E  F
4     B  C        F

基准：

關於通過復制 OP 中的數據創建的數據：

@piRSquared 使用 numpy 方法的第一種方法是這里最快的解決方案。

在隨機生成的大小不斷增加的 DataFrame 上：

重現 plot 的代碼：

import perfplot
import pandas as pd
import numpy as np

def enke(df):
    tmp = df['Col'].str.get_dummies(sep=',')
    return tmp * tmp.columns

def mozway(df):
    return pd.concat([pd.Series((idx:=x.split(',')), index=idx) 
                      for x in df['Col']], axis=1).T.fillna('')

def piRSquared(df):
    n = df.shape[0]
    i = np.repeat(np.arange(n), df.Col.str.count(',')+1)
    c, j = np.unique(df.Col.str.cat(sep=',').split(','), return_inverse=True)
    m = c.shape[0]
    a = np.full((n, m), '')
    a[i, j] = c[j]
    return pd.DataFrame(a, df.index, c)

def piRSquared2(df):
    n = df.shape[0]
    base = df.Col.to_numpy().astype(str)
    commas = np.char.count(base, ',')
    sepped = ','.join(base).split(',')
    i = np.repeat(np.arange(n), commas+1)
    c, j = np.unique(sepped, return_inverse=True)
    m = c.shape[0]
    a = np.full((n, m), '')
    a[i, j] = c[j]
    return pd.DataFrame(a, df.index, c)

def constructor1(n):
    df = pd.DataFrame({'Col': ['A,B,C', 'B,A,D', 'C', 'A,D,E,F', 'B,C,F']})
    return pd.concat([df]*n, ignore_index=True)

def constructor2(n):
    uc = np.array([*ascii_uppercase])
    k = [','.join(np.random.choice(uc, x, replace=False))
         for x in np.random.randint(1, 10, size=n)]
    return pd.DataFrame({'Col': k})

kernels = [enke, piRSquared, piRSquared2, mozway]
df = pd.DataFrame({'Col': ['A,B,C', 'B,A,D', 'C', 'A,D,E,F', 'B,C,F']})

perfplot.plot(
    setup=constructor1,
    kernels=kernels,
    labels=[func.__name__ for func in kernels],
    n_range=[2**k for k in range(15)],
    xlabel='len(df)',
    logx=True, 
    logy=True, 
    relative_to=0,
    equality_check=pd.DataFrame.equals)

Answer 2

使用pandas.concat ：

pd.concat([pd.Series((idx:=x.split(',')), index=idx)
           for x in df['Col']], axis=1).T

對於 python < 3.8：

pd.concat([pd.Series(val, index=val)
           for x in df['Col']
           for val in [x.split(',')]], axis=1).T

Output：

     A    B    C    D    E    F
0    A    B    C  NaN  NaN  NaN
1    A    B  NaN    D  NaN  NaN
2  NaN  NaN    C  NaN  NaN  NaN
3    A  NaN  NaN    D    E    F
4  NaN    B    C  NaN  NaN    F

注意。 添加fillna('')為缺失值添加空字符串

   A  B  C  D  E  F
0  A  B  C         
1  A  B     D      
2        C         
3  A        D  E  F
4     B  C        F

Answer 3

這來自我的Project Overkill技巧。

我將使用 Numpy 來確定標簽在二維數組中的放置位置。

n = df.shape[0]                                # Get number of rows
base = df.Col.to_numpy().astype(str)           # Turn `'Col'` to Numpy array
commas = np.char.count(base, ',')              # Count commas in each row
sepped = ','.join(base).split(',')             # Flat array of each element
i = np.repeat(np.arange(n), commas+1)          # Row indices for flat array

# Note that I could've used `pd.factorize` here but I actually wanted
# a sorted array of labels so `np.unique` was the way to go.
# Otherwise I'd have used `j, c = pd.factorize(sepped)`
c, j = np.unique(sepped, return_inverse=True)  # `j` col indices for flat array
                                               # `c` will be the column labels
m = c.shape[0]                                 # Get number of unique labels
a = np.full((n, m), '')                        # Default array of empty strings
a[i, j] = c[j]                                 # Use row/col indices to insert
                                               #  the column labels in right spots

pd.DataFrame(a, df.index, c)                   # Construct new dataframe

   A  B  C  D  E  F
0  A  B  C         
1  A  B     D      
2        C         
3  A        D  E  F
4     B  C        F

時間測試

函數

import pandas as pd
import numpy as np
from string import ascii_uppercase

def pir(s):
    n = s.shape[0]
    base = s.to_numpy().astype(str)
    commas = np.char.count(base, ',')
    sepped = ','.join(base).split(',')
    i = np.repeat(np.arange(n), commas+1)
    c, j = np.unique(sepped, return_inverse=True)
    m = c.shape[0]
    a = np.full((n, m), '')
    a[i, j] = c[j]
    return pd.DataFrame(a, s.index, c)

def pir2(s):
    n = s.shape[0]
    sepped = s.str.cat(sep=',').split(',')
    commas = s.str.count(',')
    i = np.repeat(np.arange(n), commas+1)
    c, j = np.unique(sepped, return_inverse=True)
    m = c.shape[0]
    a = np.full((n, m), '')
    a[i, j] = c[j]
    return pd.DataFrame(a, s.index, c)

def mozway(s):
    return pd.concat([
        pd.Series((idx:=x.split(',')), index=idx)
        for x in s
    ], axis=1).T.fillna('')

def enke(s):
    return s.str.get_dummies(',').pipe(lambda d: d*d.columns)

測試數據構造器

def constructor(n, m):
    uc = np.array([*ascii_uppercase])
    m = min(26, m)
    k = [','.join(np.random.choice(uc, x, replace=False))
         for x in np.random.randint(1, m, size=n)]
    return pd.Series(k)

結果 dataframe

res = pd.DataFrame(
    index=['enke', 'mozway', 'pir', 'pir2'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

運行測試

from IPython.display import clear_output

for j in res.columns:
    s = constructor(j, 10)
    for i in res.index:
        stmt = f'{i}(s)'
        setp = f'from __main__ import s, {i}'
        res.at[i, j] = timeit(stmt, setp, number=50)
        print(res)
        clear_output(True)

顯示結果

res.T.plot(loglog=True)

res.div(res.min()).T

           enke     mozway       pir      pir2
10     8.634105  19.416376  1.000000  2.300573
30     7.626107  32.741218  1.000000  2.028423
100    5.071308  50.539772  1.000000  1.533791
300    3.475711  66.638151  1.000000  1.184982
1000   2.616885  79.032159  1.012205  1.000000
3000   2.518983  91.521389  1.094863  1.000000
10000  2.536735  98.172680  1.131758  1.000000
30000  2.603489  99.756007  1.149734  1.000000

將列拆分為多個具有唯一值的列 pandas

問題描述

3 個解決方案

解決方案1
5

解決方案2
2 已采納 2022-04-13 18:39:55

解決方案3
2 2022-04-13 19:27:47

時間測試

函數

測試數據構造器

結果 dataframe

運行測試

顯示結果

將列拆分為多個具有唯一值的列 pandas

問題描述

3 個解決方案

解決方案1 5

解決方案2 2 已采納 2022-04-13 18:39:55

解決方案3 2 2022-04-13 19:27:47

時間測試

函數

測試數據構造器

結果 dataframe

運行測試

顯示結果

解決方案1
5

解決方案2
2 已采納 2022-04-13 18:39:55

解決方案3
2 2022-04-13 19:27:47