![](/img/trans.png)
[英]In python, convert pandas “one to many” dataset transposing rows to columns
[英]Structure dataset from rows to columns pandas python
我有一个类似以下的数据框,其中包含许多功能列,但下面仅提及3个:
productid |feature1 |value1 |feature2 |value2 | feature3 |value3
100001 |weight | 130g | |price |$140.50
100002 |weight | 200g |pieces |12 pcs | dimensions |150X75cm
100003 |dimensions |70X30cm |price |$22.90
100004 |price |$12.90 |manufacturer| ABC |calories |556Kcal
100005 |calories |1320Kcal|dimensions |20X20cm |manufacturer | XYZ
我想用熊猫通过以下方式构造它:
productid weight dimensions price calories no. of pieces manufacturer
100001 130g $140.50
100002 200g 150X75cm 12 pcs
100003 70X30cm $22.90
100004 $12.90 556Kcal ABC
100005 20X20cm 1320Kcal XYZ
我研究了各种pandas方法,例如reset_index,stack等,但没有以所需的方式进行转换。
您正在寻找解压缩数据框的代码。 直接的方法是(具有许多功能,并可能重复生产编号):
import pandas as pd
import numpy as np
def expand(frame):
df = pd.DataFrame()
for row in frame.iterrows():
data = row[1]
for feature_name, feature_value in zip(data[1::2], data[2::2]):
if feature_name:
df.loc[data.productid, feature_name] = feature_value
return df.replace(np.nan, '')
df = pd.DataFrame([("100001", "weight", "130g", None, None, "price", "$140.50"),
("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150X75cm"),
("100003", "dimensions", "70X30cm", "price", "$22.90"),
("100004", "price", "$12.90", "manufacturer", "ABC", "calories", "556Kcal"),
("100005", "calories", "1320Kcal", "dimensions", "20X20cm", "manufacturer", "XYZ")],
columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"])
xdf = expand(df)
print(xdf)
输出:
weight price pieces dimensions manufacturer calories
100001 130g $140.50
100002 200g 12 pcs 150X75cm
100003 $22.90 70X30cm
100004 $12.90 ABC 556Kcal
100005 20X20cm XYZ 1320Kcal
EDIT1:略压缩的形式:(慢!)
def expand2(frame):
return pd.DataFrame.from_dict(
{data.productid: {f: v for f, v in zip(data[1::2], data[2::2]) if f} for _, data in frame.iterrows()},
orient='index')
EDIT2:使用生成器表达式:
def expand3(frame):
return pd.DataFrame.from_records(
({f: v for f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f}
for _, data
in frame.iterrows()), index='productid').replace(np.nan, '')
一些测试(用@timeit
装饰函数):
def timeit(f):
@functools.wraps(f)
def timed(*args, **kwargs):
try:
start_time = time.time()
return f(*args, **kwargs)
finally:
end_time = time.time()
function_invocation = "x"
sys.stdout.flush()
print(f'Function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.',
flush=True, file=sys.stderr)
return timed
def generate_wide_df(n_rows, n_features):
possible_labels = [f'label_{i}' for i in range(n_features)]
columns = ['productid']
for i in range(1, n_features):
columns.append(f'feature_{i}')
columns.append(f'value_{i}')
df = pd.DataFrame(columns=columns)
for row_n in range(n_rows):
df.loc[row_n, 'productid'] = int(1000000 + row_n)
for _ in range(n_features):
feature_num = random.randint(1, n_features)
df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels)
df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000)
return df.where(df.notnull(), None)
df = generate_wide_df(4000, 30)
expand(df)
expand3(df)
expand2(df)
结果:
Function expand(x), took: 1.1576 seconds.
Function expand3(x), took: 1.1185 seconds.
Function expand2(x), took: 16.3055 seconds.
这是一个可复制的示例,请查看注释以获取详细信息。
import pandas as pd
from StringIO import StringIO
data = """
productid|feature1|value1|feature2|value2|feature3|value3
100001|weight|130g|||price|$140.50
100002|weight|200g|pieces|12pcs|dimensions|150X75cm
100003|dimensions|70X30cm|price|$22.90||
100004|price|$12.90|manufacturer|ABC|calories|556Kcal
100005|calories|1320Kcal|dimensions|20X20cm|manufacturer|XYZ
"""
# simulate reading from a csv file
df= pd.read_csv(StringIO(data), sep="|")
# pivot all (productid, feature{x}, value{x}) tuples into a tabular dataframe
# and append them to the following list
converted = []
# you can construct this programmatically (out of scope for now)
mapping = {"feature1": "value1", "feature2": "value2","feature3": "value3"}
# iteritems() become items() in python3
for feature, values in mapping.iteritems():
# pivot (productid, feature{x}, value{x}) into a tabular dataframe
# columns names : feature{x}
# values: value{x}
df1 = pd.pivot_table(df, values=values, index=["productid"], columns=[feature], aggfunc=lambda x: x.iloc[0])
# remove the name from the pivoted dataframe to get a standard dataframe
df1.columns.name = None
# keep productid in the dataframe as a column
df1.reset_index(inplace=True)
converted.append(df1)
# merge all dataframe in the list converted into one dataframe
final_df1 = converted[0]
for index,df_ in enumerate(converted[1:]):
final_df1 = pd.merge(final_df1, df_, how="outer")
import numpy as np
# replace None with np.nan so groupby().first() take the first none NaN vaues
final_df1.fillna(value=np.nan, inplace=True)
# format the data to be iso to what the OP wants
final_df1 = final_df1.groupby("productid", as_index=False).first()
print(final_df1)
输出 :
productid dimensions manufacturer pieces price calories weight
0 100001 NaN NaN NaN $140.50 NaN 130g
1 100002 150X75cm NaN 12pcs NaN NaN 200g
2 100003 70X30cm NaN NaN $22.90 NaN NaN
3 100004 NaN ABC NaN $12.90 556Kcal NaN
4 100005 20X20cm XYZ NaN NaN 1320Kcal NaN
这里的难点是您具有多个功能和多个值列。 大熊猫很难在不给予帮助的情况下意识到这一点。 例如,如果您只是DataFrame的子部分具有一个功能和一个值,
subdf = df[['productid', 'feature1', 'value1']].copy()
print(subdf)
productid feature1 value1
0 100001 weight 130g
1 100002 weight 200g
2 100003 dimensions 70X30cm
3 100004 price $12.90
4 100005 calories 1320Kcal
print(subdf.pivot(index='productid', columns='feature1',
values='value1'))
feature1 calories dimensions price weight
productid
100001 None None None 130g
100002 None None None 200g
100003 None 70X30cm None None
100004 None None $12.90 None
100005 1320Kcal None None None
在更复杂的情况下,一种入门方法是首先堆叠所有要素列和所有value列 。 这样,您的中间结果就是具有一个功能和一个值列的单个DataFrame。 这样可以使pivot
可以接受的形式。 它还避免需要构建涉及进一步迭代的混乱函数。
features = pd.concat([df[col] for col in df.filter(like='feature')])
values = pd.concat([df[col] for col in df.filter(like='value')])
res = pd.concat((features, values), axis=1)
# unfortunately, `res` has lost its product ids but we can map them
# back from their index ids from the original df
ids = df.productid.to_dict()
res.index = res.index.map(lambda x: ids[x])
现在在res
上调用pivot
很简单:
res = res.dropna().pivot(columns=0, values=1)
res.index.name = 'productid'
print(res)
calories dimensions manufacturer pieces price weight
productid
100001 None None None None $140.50 130g
100002 None 150X75cm None 12pcs None 200g
100003 None 70X30cm None None $22.90 None
100004 556Kcal None ABC None $12.90 None
100005 1320Kcal 20X20cm XYZ None None None
该解决方案的优势在于,您只需调用一次pivot
,而不必在每个子框架上调用。 涉及的唯一迭代是pd.concat
,对于大型数据集,应该有明显的加速。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.