简体   繁体   English

Pandas:按列中的观察数量扩展DataFrame

[英]Pandas: expanding DataFrame by number of observations in column

Stata has the function expand which adds rows to a database corresponding to values in a particular column. Stata具有函数expand ,它将行添加到与特定列中的值对应的数据库中。 For example: 例如:

I have: 我有:

df = pd.DataFrame({"A":[1, 2, 3], 
                   "B":[3,4,5]})

   A  B
0  1  3
1  2  4
2  3  5

What I need: 我需要的:

df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3], 
                    "B":[3,4,5, 4, 5, 5]})

   A  B
0  1  3
1  2  4
2  3  5
3  2  4
4  3  5
6  3  5

The value in df.loc[0,'A'] is 1, so no additional row is added to the end of the DataFrame, since B=3 is only supposed to occur once. df.loc [0,'A']中的值为1,因此没有额外的行添加到DataFrame的末尾,因为B = 3只应该发生一次。

The value in df.loc[1,'A'] is 2, so one observation was added to the end of the DataFrame, bringing the total occurrences of B=4 to 2. df.loc [1,'A']中的值为2,因此在DataFrame的末尾添加了一个观察值,使B = 4的总出现次数为2。

The value in df.loc[2,'A'] is 3, so two observations were added to the end of the DataFrame, bringing the total occurrences of B=5 to 3. df.loc [2,'A']中的值为3,因此将两个观察值添加到DataFrame的末尾,使得B = 5的总出现次数为3。

I've scoured prior questions for something to get me started, but no luck. 为了让我开始,我已经仔细研究了以前的问题,但没有运气。 Any help is appreciated. 任何帮助表示赞赏。

There are a number of possibilities, all built around np.repeat : 有许多可能性,都是围绕np.repeat

def using_reindex(df):
    return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)

def using_dictcomp(df):
    return  pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0) 
                          for col in df})

def using_df_values(df):
    return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns)

def using_loc(df):
    return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)

For example, 例如,

In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]})
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
Out[220]: 
   A  B
0  1  3
1  2  4
2  2  4
3  3  5
4  3  5
5  3  5

Here is a benchmark on a 1000-row DataFrame; 这是1000行DataFrame的基准测试; the result being a roughly 500K-row DataFrame: 结果是一个大约500K行的DataFrame:

In [208]: df = make_dataframe(1000)

In [210]: %timeit using_dictcomp(df)
10 loops, best of 3: 23.6 ms per loop

In [218]: %timeit using_reindex(df)
10 loops, best of 3: 35.8 ms per loop

In [211]: %timeit using_df_values(df)
10 loops, best of 3: 31.3 ms per loop

In [212]: %timeit using_loc(df)
1 loop, best of 3: 275 ms per loop

This is the code I used to generate df : 这是我用来生成df的代码:

import numpy as np
import pandas as pd

def make_dataframe(nrows=100):
    df = pd.DataFrame(
        {'A': np.arange(nrows),
         'float': np.random.randn(nrows),
         'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows),
         'datetime64': pd.date_range('20000101', periods=nrows)},
        index=pd.date_range('20000101', periods=nrows))
    return df

df = make_dataframe(1000)

If there are only a few columns, using_dictcomp is the fastest. 如果只有几列,则using_dictcomp是最快的。 But note that using_dictcomp assumes df has unique column names. 但请注意, using_dictcomp假设df具有唯一的列名。 The dict comprehension in using_dictcomp won't repeat duplicated column names. using_dictcomp的字典理解不会重复重复的列名。 The other alternatives will work with repeated column names, however. 但是,其他替代方法将使用重复的列名称。

Both using_reindex and using_loc assume df has a unique index. using_reindexusing_loc都假定df具有唯一索引。


using_reindex came from cᴏʟᴅsᴘᴇᴇᴅ's using_loc , in an (unfortunately) now deleted post. using_reindex来自cᴏʟᴅsᴘᴇᴇᴅ的using_loc ,在一个(不幸的是)现已删除的帖子中。 cᴏʟᴅsᴘᴇᴇᴅ showed it wasn't necessary to manually repeat all the values -- you only need to repeat the index and then let df.loc (or df.reindex ) repeat all the rows for you. cᴏʟᴅsᴘᴇᴇᴅ表明没有必要手动重复所有值 - 你只需要重复索引然后让df.loc (或df.reindex )为你重复所有行。 It also avoids accessing df.values which could generate an intermediate NumPy array of object dtype if df contains columns of multiple dtypes. 它还避免访问df.values ,如果df包含多个dtypes的列,则df.values可以生成object df.values的中间NumPy数组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 通过扩展日期将观察结果添加到pandas数据框中 - Add observations to a pandas dataframe by expanding dates 在列上分组 pandas DataFrame 并对其求和,同时保留求和观察的数量 - Group pandas DataFrame on column and sum it while retaining the number of sumed observations 使用可变数量的元素和前导文本扩展一列 pandas dataframe - Expanding a column of pandas dataframe with variable number of elements and leading texts 使用列范围扩展熊猫数据框 - Expanding pandas dataframe with column range 根据定义pandas中的类别的列过滤掉没有足够观察次数的DataFrame行 - Filter out DataFrame rows that have insufficient number of observations based on a column defining a category in pandas 如何在 pandas dataframe 中制作相同数量的观察值? - How do I make bins of equal number of observations in a pandas dataframe? 我想在PANDAS数据框中计算每个主题内的观察次数 - I want to count number of observations within each subject in PANDAS dataframe 熊猫-创建一个新列,并在另一列中填充观察值 - Pandas- Create a new column filled with the number of observations in another column 扩展熊猫数据框中的行 - expanding rows in pandas dataframe Pandas数据框:将列扩展为行,再加上增量编号 - Pandas dataframe: Expanding column into rows plus incremental numbering
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM