簡體   English   中英

將具有可變長度逗號分隔值的熊貓系列轉換為數據框

[英]Transform pandas Series with variable length comma separated values to Dataframe

我有一個包含逗號分隔值的熊貓系列“ A”,如下所示:

index    A

1        null
2        5,6
3        3
4        null
5        5,18,22
...      ...

我需要這樣的數據框:

index    A_5    A_6    A_18    A_20

1        0      0      0       ...
2        1      1      0       ...
3        0      0      0       ...
4        0      0      0       ...
5        1      0      1       ...
...      ...    ...    ...     ...

至少出現MIN_OBS次的值應該被忽略,而不要擁有自己的列,因為有太多不同的值,如果不應用此閾值,df將會變得太大。

我在下面設計了解決方案。 它可以工作,但是太慢了(由於我想遍歷行)。 有人可以建議一種更快的方法嗎?

temp_dict = defaultdict(int)
for k, v in A.iteritems():
    temp_list = v.split(',')
    for item in temp_list:
        temp_dict[item] += 1

cols_to_make = []
for k, v in temp_dict.iteritems():
    if v > MIN_OBS:
        cols_to_make.append('A_' + k)

result_df = pd.DataFrame(0, index = the_series.index, columns = cols_to_make)
for k, v in A.iteritems():
    temp_list = v.split(',')
    for item in temp_list:
    if ('A_' + item) in cols_to_make:
        temp_df['A_' + item][k] = 1

您可以使用get_dummies創建指標變量,然后通過to_numeric將列轉換為數字,並通過變量TRESHix最后一個過濾器列轉換為數字:

print df
             A
index         
1         null
2          5,6
3            3
4         null
5      5,18,22

df = df.A.str.get_dummies(sep=",")
print df
       18  22  3  5  6  null
index                       
1       0   0  0  0  0     1
2       0   0  0  1  1     0
3       0   0  1  0  0     0
4       0   0  0  0  0     1
5       1   1  0  1  0     0

df.columns = pd.to_numeric(df.columns, errors='coerce')
df = df.sort_index(axis=1)

TRESH = 5
cols = [col for col in df.columns if col > TRESH]
print cols
[6.0, 18.0, 22.0]
df = df.ix[:, cols]
print df
       6   18  22
index            
1       0   0   0
2       1   0   0
3       0   0   0
4       0   0   0
5       0   1   1

df.columns = ["A_" + str(int(col)) for col in df.columns]
print df
       A_6  A_18  A_22
index                 
1        0     0     0
2        1     0     0
3        0     0     0
4        0     0     0
5        0     1     1

編輯:

我嘗試修改完美的原始unutbu answer並更改創建Series ,刪除index具有null值的Series並向get_dummies添加參數prefix

import numpy as np
import pandas as pd

s = pd.Series(['null', '5,6', '3', 'null', '5,18,22', '3,4'])
print s

#result = s.str.split(',').apply(pd.Series).stack()
#replacing to:
result = pd.DataFrame([ x.split(',') for x in s ]).stack()
count = pd.value_counts(result)

min_obs = 2

#add removing Series, which contains null
count = count[(count >= min_obs) & ~(count.index.isin(['null'])) ]

result = result.loc[result.isin(count.index)]
#add prefix to function get_dummies
result = pd.get_dummies(result, prefix="A")

result.index = result.index.droplevel(1)
result = result.reindex(s.index)

print(result)
   A_3  A_5
0  NaN  NaN
1    0    1
2    1    0
3  NaN  NaN
4    0    1
5    1    0

時間:

In [143]: %timeit pd.DataFrame([ x.split(',') for x in s ]).stack()
1000 loops, best of 3: 866 µs per loop

In [144]: %timeit s.str.split(',').apply(pd.Series).stack()
100 loops, best of 3: 2.46 ms per loop

由於內存是一個問題,因此,如果可能,我們必須小心不要構建大型的中間數據結構。

讓我們從OP的有效發布代碼開始:

def orig(A, MIN_OBS):
    temp_dict = collections.defaultdict(int)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    cols_to_make = []
    for k, v in temp_dict.iteritems():
        if v > MIN_OBS:
            cols_to_make.append('A_' + k)

    result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            if ('A_' + item) in cols_to_make:
                result_df['A_' + item][k] = 1
    return result_df

並將第一個循環提取到其自己的函數中:

def count(A, MIN_OBS):
    temp_dict = collections.Counter()
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
    return temp_dict

通過在交互式會話中進行實驗,我們可以看到這不是瓶頸; 即使對於“大” DataFrame, count(A, MIN_OBS)很快完成。

的緩慢orig發生在雙for-loop在年底orig其中遞增一次修改在數據幀的一個值的細胞(例如result_df['A_' + item][k] = 1 )。

我們可以使用向量化字符串方法A.str.contains在字符串中搜索值,在DataFrame的列上用單個for循環替換double-for循環。 由於我們從未將原始字符串拆分為Python字符串列表(或包含字符串片段的Pandas DataFrames),因此可以節省一些內存。 由於origalt使用相似的數據結構,因此它們的內存占用量大致相同。

def alt(A, MIN_OBS):
    temp_dict = count(A, MIN_OBS)
    df = pd.DataFrame(0, index=A.index, columns=temp_dict)
    for col in df:
        df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
    df.columns = ['A_{}'.format(col) for col in df]
    return df

這是一個示例,在200K行DataFrame中,其可能的值有40K個不同:

import numpy as np
import pandas as pd
import collections

np.random.seed(2016)
ncols = 5
nrows = 200000
nvals = 40000
MIN_OBS = 200

# nrows = 20
# nvals = 4
# MIN_OBS = 2

idx = np.random.randint(ncols, size=nrows).cumsum()
data = np.random.choice(np.arange(nvals), size=idx[-1])
data = np.array_split(data, idx[:-1])
data = map(','.join, [map(str, arr) for arr in data])
A = pd.Series(data)
A.loc[A == ''] = 'null'

def orig(A, MIN_OBS):
    temp_dict = collections.defaultdict(int)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    cols_to_make = []
    for k, v in temp_dict.iteritems():
        if v > MIN_OBS:
            cols_to_make.append('A_' + k)

    result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            if ('A_' + item) in cols_to_make:
                result_df['A_' + item][k] = 1
    return result_df

def count(A, MIN_OBS):
    temp_dict = collections.Counter()
    for k, v in A.iteritems():
        temp_list = v.split(',')
        for item in temp_list:
            temp_dict[item] += 1
    temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
    return temp_dict

def alt(A, MIN_OBS):
    temp_dict = count(A, MIN_OBS)
    df = pd.DataFrame(0, index=A.index, columns=temp_dict)
    for col in df:
        df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
    df.columns = ['A_{}'.format(col) for col in df]
    return df

這是一個基准:

In [48]: %timeit expected = orig(A, MIN_OBS)
1 loops, best of 3: 3.03 s per loop

In [49]: %timeit expected = alt(A, MIN_OBS)
1 loops, best of 3: 483 ms per loop

請注意, alt完成所需的大部分時間都用在count

In [60]: %timeit count(A, MIN_OBS)
1 loops, best of 3: 304 ms per loop

像這樣的東西會工作還是可以對其進行修改以滿足您的需求?

df = pd.DataFrame({'A': ['null', '5,6', '3', 'null', '5,18,22']}, columns=['A'])

         A
0     null
1      5,6
2        3
3     null
4  5,18,22

然后使用get_dummies()

pd.get_dummies(df['A'].str.split(',').apply(pd.Series), prefix=df.columns[0])

結果:

       A_3  A_5  A_null  A_18  A_6  A_22
index                                   
1        0    0       1     0    0     0
2        0    1       0     0    1     0
3        1    0       0     0    0     0
4        0    0       1     0    0     0
5        0    1       0     1    0     1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM