繁体   English   中英

从行到列构造数据集pandas python

[英]Structure dataset from rows to columns pandas python

我有一个类似以下的数据框,其中包含许多功能列,但下面仅提及3个:

productid   |feature1   |value1 |feature2    |value2     | feature3    |value3
100001      |weight     | 130g   |                       |price        |$140.50
100002      |weight     | 200g   |pieces     |12 pcs     | dimensions  |150X75cm
100003      |dimensions |70X30cm |price      |$22.90        
100004      |price      |$12.90  |manufacturer| ABC    |calories    |556Kcal
100005      |calories   |1320Kcal|dimensions |20X20cm  |manufacturer   | XYZ

我想用熊猫通过以下方式构造它:

productid   weight  dimensions  price   calories    no. of pieces   manufacturer
100001       130g              $140.50          
100002       200g    150X75cm                         12 pcs    
100003               70X30cm    $22.90          
100004                          $12.90   556Kcal                          ABC
100005               20X20cm            1320Kcal                         XYZ

我研究了各种pandas方法,例如reset_index,stack等,但没有以所需的方式进行转换。

您正在寻找解压缩数据框的代码。 直接的方法是(具有许多功能,并可能重复生产编号):

import pandas as pd
import numpy as np

def expand(frame):
    df = pd.DataFrame()
    for row in frame.iterrows():
        data = row[1]
        for feature_name, feature_value in zip(data[1::2], data[2::2]):
            if feature_name:
                df.loc[data.productid, feature_name] = feature_value
    return df.replace(np.nan, '')


df = pd.DataFrame([("100001", "weight", "130g", None, None, "price", "$140.50"),
("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150X75cm"),
("100003", "dimensions", "70X30cm", "price", "$22.90"),
("100004", "price", "$12.90", "manufacturer", "ABC", "calories", "556Kcal"),
("100005", "calories", "1320Kcal", "dimensions", "20X20cm", "manufacturer", "XYZ")],
                  columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"])

xdf = expand(df)
print(xdf)

输出:

       weight    price  pieces dimensions manufacturer  calories
100001   130g  $140.50                                          
100002   200g           12 pcs   150X75cm                       
100003          $22.90            70X30cm                       
100004          $12.90                             ABC   556Kcal
100005                            20X20cm          XYZ  1320Kcal

EDIT1:略压缩的形式:(慢!)

def expand2(frame):
    return pd.DataFrame.from_dict(
        {data.productid: {f: v for f, v in zip(data[1::2], data[2::2]) if f} for _, data in frame.iterrows()},
        orient='index')

EDIT2:使用生成器表达式:

def expand3(frame):
    return pd.DataFrame.from_records(
        ({f: v for f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f}
         for _, data
         in frame.iterrows()), index='productid').replace(np.nan, '')

一些测试(用@timeit装饰函数):

def timeit(f):
    @functools.wraps(f)
    def timed(*args, **kwargs):
        try:
            start_time = time.time()
            return f(*args, **kwargs)
        finally:
            end_time = time.time()
            function_invocation = "x"
            sys.stdout.flush()
            print(f'Function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.',
                  flush=True, file=sys.stderr)

    return timed

def generate_wide_df(n_rows, n_features):
    possible_labels = [f'label_{i}' for i in range(n_features)]
    columns = ['productid']
    for i in range(1, n_features):
        columns.append(f'feature_{i}')
        columns.append(f'value_{i}')

    df = pd.DataFrame(columns=columns)
    for row_n in range(n_rows):
        df.loc[row_n, 'productid'] = int(1000000 + row_n)
        for _ in range(n_features):
            feature_num = random.randint(1, n_features)
            df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels)
            df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000)
    return df.where(df.notnull(), None)


df = generate_wide_df(4000, 30)


expand(df)
expand3(df)
expand2(df)

结果:

Function expand(x), took: 1.1576 seconds.
Function expand3(x), took: 1.1185 seconds.
Function expand2(x), took: 16.3055 seconds.

这是一个可复制的示例,请查看注释以获取详细信息。

import pandas as pd 
from StringIO import StringIO 

data = """
productid|feature1|value1|feature2|value2|feature3|value3
100001|weight|130g|||price|$140.50
100002|weight|200g|pieces|12pcs|dimensions|150X75cm
100003|dimensions|70X30cm|price|$22.90||
100004|price|$12.90|manufacturer|ABC|calories|556Kcal
100005|calories|1320Kcal|dimensions|20X20cm|manufacturer|XYZ
"""
# simulate reading from a csv file
df= pd.read_csv(StringIO(data), sep="|")

# pivot all (productid, feature{x}, value{x}) tuples into a tabular dataframe 
# and append them to the following list 
converted = []

# you can construct this programmatically (out of scope for now) 
mapping = {"feature1": "value1", "feature2": "value2","feature3": "value3"}

# iteritems() become items() in python3
for feature, values in mapping.iteritems():
        # pivot  (productid, feature{x}, value{x}) into a tabular dataframe 
        # columns names : feature{x} 
        # values: value{x}  
        df1 = pd.pivot_table(df, values=values, index=["productid"], columns=[feature], aggfunc=lambda x: x.iloc[0]) 
        # remove the name from the pivoted dataframe to get a standard dataframe 
        df1.columns.name = None
        # keep productid in the dataframe as a column 
        df1.reset_index(inplace=True)
        converted.append(df1)

# merge all dataframe in the list converted into one dataframe 
final_df1 = converted[0] 
for index,df_ in enumerate(converted[1:]):
        final_df1 = pd.merge(final_df1, df_, how="outer")

import numpy as np 
# replace None with np.nan so groupby().first() take the first none NaN vaues 
final_df1.fillna(value=np.nan, inplace=True)
# format the data to be iso to what the OP wants  
final_df1 = final_df1.groupby("productid", as_index=False).first()

print(final_df1)

输出 :

   productid dimensions manufacturer pieces    price  calories weight
0     100001        NaN          NaN    NaN  $140.50       NaN   130g
1     100002   150X75cm          NaN  12pcs      NaN       NaN   200g
2     100003    70X30cm          NaN    NaN   $22.90       NaN    NaN
3     100004        NaN          ABC    NaN   $12.90   556Kcal    NaN
4     100005    20X20cm          XYZ    NaN      NaN  1320Kcal    NaN

这里的难点是您具有多个功能和多个值列。 大熊猫很难在不给予帮助的情况下意识到这一点。 例如,如果您只是DataFrame的子部分具有一个功能和一个值,

subdf = df[['productid', 'feature1', 'value1']].copy()    
print(subdf)
   productid    feature1    value1
0     100001      weight      130g
1     100002      weight      200g
2     100003  dimensions   70X30cm
3     100004       price    $12.90
4     100005    calories  1320Kcal

...您可以在.pivot使用.pivot

print(subdf.pivot(index='productid', columns='feature1', 
      values='value1'))
feature1   calories dimensions   price weight
productid                                    
100001         None       None    None   130g
100002         None       None    None   200g
100003         None    70X30cm    None   None
100004         None       None  $12.90   None
100005     1320Kcal       None    None   None

在更复杂的情况下,一种入门方法是首先堆叠所有要素列和所有value列 这样,您的中间结果就是具有一个功能和一个值列的单个DataFrame。 这样可以使pivot可以接受的形式。 它还避免需要构建涉及进一步迭代的混乱函数。

features = pd.concat([df[col] for col in df.filter(like='feature')])
values = pd.concat([df[col] for col in df.filter(like='value')])
res = pd.concat((features, values), axis=1)

# unfortunately, `res` has lost its product ids but we can map them
# back from their index ids from the original df
ids = df.productid.to_dict()
res.index = res.index.map(lambda x: ids[x])

现在在res上调用pivot很简单:

res = res.dropna().pivot(columns=0, values=1)
res.index.name = 'productid'

print(res)
           calories dimensions manufacturer pieces    price weight
productid                                                         
100001         None       None         None   None  $140.50   130g
100002         None   150X75cm         None  12pcs     None   200g
100003         None    70X30cm         None   None   $22.90   None
100004      556Kcal       None          ABC   None   $12.90   None
100005     1320Kcal    20X20cm          XYZ   None     None   None

该解决方案的优势在于,您只需调用一次pivot ,而不必在每个子框架上调用。 涉及的唯一迭代是pd.concat ,对于大型数据集,应该有明显的加速。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM