Pandas DataFrame將字符串拆分為帶有條件和缺少數據的多列

Question

所以我有一個DataFrame看起來像這樣：

df = pd.DataFrame({'feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"]})

df
       feature1  feature2                     unparsed_features
    0        34         1  neoclassical, heavy, $2, old, bronze
    1        45         0                 romanticism, gold, $5
    2        52         1                  baroque, xs, $3, new

我試圖分裂列unparsed_features成6列（體重，年齡，顏色，尺寸，價格和期限），但正如你所看到的順序混亂了，不僅如此，某些領域缺少過。

我對每列的內容可能有一個大致的了解，如下所示：

main_dict = {
 'weight': ['heavy','light'],
 'age': ['new','old'],
 'colour': ['gold','silver','bronze'],
 'size': ['xs','s','m','l','xl','xxl','xxxl'],
 'price': ['$'],
 'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
}

理想情況下，我希望我的數據框如下所示：

df
   feature1  feature2                     unparsed_features weight price  age  \
0        34         1  neoclassical, heavy, $2, old, bronze  heavy    $2  old   
1        45         0                 romanticism, gold, $5           $5        
2        52         1                  baroque, xs, $3, new           $3  new   

  size  colour        period  
0       bronze  neoclassical  
1         gold   romanticism  
2   xs               baroque

我知道第一步是用逗號分割字符串，但此后我迷路了。

df['unparsed_features'].str.split(',')

謝謝您的幫助。

Answer 1

由於'unparsed_features'中的數據在每一行中的結構都不相同，因此不確定是否有簡便的方法。 一種方法是使用您定義的字典main_dict ，遍歷每個項目，並使用str.extract和pat參數對price str.extract不同：

for key, list_item in main_dict.items():
    if key =='price':
        df[key] = df.unparsed_features.str.extract('(\$\d+)').fillna('')
    else:
        df[key] = df.unparsed_features.str.extract('((^|\W)' +'|(^|\W)'.join(list_item) + ')').fillna('')

\\$\\d+允許查找符號$之后的任何數字，而(^|\\W)查找list_item任何單詞之前的空格或行首。

您將得到預期的結果：

   feature1  feature2                     unparsed_features  weight   age  \
0        34         1  neoclassical, heavy, $2, old, bronze   heavy   old   
1        45         0                 romanticism, gold, $5                 
2        52         1                  baroque, xs, $3, new           new   

    colour size price        period  
0   bronze         $2  neoclassical  
1     gold         $5   romanticism  
2            xs    $3       baroque

Answer 2

坦白地說，WB是正確的，您需要修改自己的字典，但是要解決以下可用數據是我的方法

for keys in main_dict:
    data_list = []
    for value in df.unparsed_features: # for every row
        for l_data in main_dict[keys]:
            if keys == 'price':
                matching = [v for v in value.split(',') if l_data in v]
            else:
                matching = [v for v in value.split(',') if l_data == v.strip()]

            if matching:
                break

        if matching:
            data_list.append(matching[0])
        else:
            data_list.append(None)

        matching = ''  
    df[keys] = data_list

產量

   feature1  feature2                     unparsed_features  weight   age  \
0        34         1  neoclassical, heavy, $2, old, bronze   heavy   old   
1        45         0                 romanticism, gold, $5    None  None   
2        52         1                  baroque, xs, $3, new    None   new   

    colour  size price        period  
0   bronze  None    $2  neoclassical  
1     gold  None    $5   romanticism  
2     None    xs    $3       baroque

Pandas DataFrame將字符串拆分為帶有條件和缺少數據的多列

問題描述

2 個解決方案

解決方案1
0 2018-11-13 15:40:19

解決方案2
0 2018-11-13 15:42:16

Pandas DataFrame將字符串拆分為帶有條件和缺少數據的多列

問題描述

2 個解決方案

解決方案1 0 2018-11-13 15:40:19

解決方案2 0 2018-11-13 15:42:16

解決方案1
0 2018-11-13 15:40:19

解決方案2
0 2018-11-13 15:42:16