[英]Pandas: expanding DataFrame by number of observations in column
Stata has the function expand which adds rows to a database corresponding to values in a particular column. Stata具有函数expand ,它将行添加到与特定列中的值对应的数据库中。 For example:
例如:
I have: 我有:
df = pd.DataFrame({"A":[1, 2, 3],
"B":[3,4,5]})
A B
0 1 3
1 2 4
2 3 5
What I need: 我需要的:
df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3],
"B":[3,4,5, 4, 5, 5]})
A B
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
6 3 5
The value in df.loc[0,'A'] is 1, so no additional row is added to the end of the DataFrame, since B=3 is only supposed to occur once. df.loc [0,'A']中的值为1,因此没有额外的行添加到DataFrame的末尾,因为B = 3只应该发生一次。
The value in df.loc[1,'A'] is 2, so one observation was added to the end of the DataFrame, bringing the total occurrences of B=4 to 2. df.loc [1,'A']中的值为2,因此在DataFrame的末尾添加了一个观察值,使B = 4的总出现次数为2。
The value in df.loc[2,'A'] is 3, so two observations were added to the end of the DataFrame, bringing the total occurrences of B=5 to 3. df.loc [2,'A']中的值为3,因此将两个观察值添加到DataFrame的末尾,使得B = 5的总出现次数为3。
I've scoured prior questions for something to get me started, but no luck. 为了让我开始,我已经仔细研究了以前的问题,但没有运气。 Any help is appreciated.
任何帮助表示赞赏。
There are a number of possibilities, all built around np.repeat
: 有许多可能性,都是围绕
np.repeat
:
def using_reindex(df):
return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
def using_dictcomp(df):
return pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0)
for col in df})
def using_df_values(df):
return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns)
def using_loc(df):
return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)
For example, 例如,
In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]})
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
Out[220]:
A B
0 1 3
1 2 4
2 2 4
3 3 5
4 3 5
5 3 5
Here is a benchmark on a 1000-row DataFrame; 这是1000行DataFrame的基准测试; the result being a roughly 500K-row DataFrame:
结果是一个大约500K行的DataFrame:
In [208]: df = make_dataframe(1000)
In [210]: %timeit using_dictcomp(df)
10 loops, best of 3: 23.6 ms per loop
In [218]: %timeit using_reindex(df)
10 loops, best of 3: 35.8 ms per loop
In [211]: %timeit using_df_values(df)
10 loops, best of 3: 31.3 ms per loop
In [212]: %timeit using_loc(df)
1 loop, best of 3: 275 ms per loop
This is the code I used to generate df
: 这是我用来生成
df
的代码:
import numpy as np
import pandas as pd
def make_dataframe(nrows=100):
df = pd.DataFrame(
{'A': np.arange(nrows),
'float': np.random.randn(nrows),
'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows),
'datetime64': pd.date_range('20000101', periods=nrows)},
index=pd.date_range('20000101', periods=nrows))
return df
df = make_dataframe(1000)
If there are only a few columns, using_dictcomp
is the fastest. 如果只有几列,则
using_dictcomp
是最快的。 But note that using_dictcomp
assumes df
has unique column names. 但请注意,
using_dictcomp
假设df
具有唯一的列名。 The dict comprehension in using_dictcomp
won't repeat duplicated column names. using_dictcomp
的字典理解不会重复重复的列名。 The other alternatives will work with repeated column names, however. 但是,其他替代方法将使用重复的列名称。
Both using_reindex
and using_loc
assume df
has a unique index. using_reindex
和using_loc
都假定df
具有唯一索引。
using_reindex
came from cᴏʟᴅsᴘᴇᴇᴅ's using_loc
, in an (unfortunately) now deleted post. using_reindex
来自cᴏʟᴅsᴘᴇᴇᴅ的using_loc
,在一个(不幸的是)现已删除的帖子中。 cᴏʟᴅsᴘᴇᴇᴅ showed it wasn't necessary to manually repeat all the values -- you only need to repeat the index and then let df.loc
(or df.reindex
) repeat all the rows for you. cᴏʟᴅsᴘᴇᴇᴅ表明没有必要手动重复所有值 - 你只需要重复索引然后让
df.loc
(或df.reindex
)为你重复所有行。 It also avoids accessing df.values
which could generate an intermediate NumPy array of object
dtype if df
contains columns of multiple dtypes. 它还避免访问
df.values
,如果df
包含多个dtypes的列,则df.values
可以生成object
df.values
的中间NumPy数组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.