简体   繁体   English

如何有效地将熊猫DataFrame列化(=透视)(使用groupby)?

[英]How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

To give you the context of the question: 为您提供问题的背景:

I have decent SQL table (72M rows, 6GB) with data which could be understood as "column-based", eg: 我有不错的SQL表(72M行,6GB),其中的数据可以理解为“基于列”,例如:

------------------------------
| fk_id | date       | field |
------------------------------
|     1 | 2001-01-02 |    24 |
|     1 | 2001-01-03 |    25 |
|     1 | 2001-01-04 |    21 |
|     1 | 2001-01-05 |    20 |
|     1 | 2001-01-06 |    30 |
|     1 | 2001-01-07 |    33 |
|            ....            |
|     2 | 2001-01-02 |    10 |
|     2 | 2001-01-03 |    15 |
|     2 | 2001-01-04 |    12 |
|     2 | 2001-01-05 |    11 |
|     2 | 2001-01-06 |    10 |
|     2 | 2001-01-07 |    12 |
|            ....            |
|            ....            |
| 12455 | 2015-01-01 |    99 |
| 12456 | 2005-10-10 |    10 |
| 12456 | 2005-10-11 |    10 |
|            ....            |
------------------------------

The desired end result in Python as a pandas.DataFrame should look like this, where date becomes the index column the foreign keys the column names and the values of the column field the content of a matrix: Python中期望的最终结果是pandas.DataFrame应该看起来像这样,其中date成为索引列,外键键入列名称,并且列field的值表示矩阵的内容:

------------------------------------------------------
| date       |     1 |     2 |  .... | 12455 | 12456 | 
------------------------------------------------------
| 2001-01-02 |    24 |    10 |  .... |   NaN |   NaN |
| 2001-01-03 |    25 |    15 |  .... |   NaN |   NaN |
| 2001-01-04 |    21 |    12 |  .... |   NaN |   NaN |
| 2001-01-05 |    20 |    11 |  .... |   NaN |   NaN |
| 2001-01-06 |    30 |    10 |  .... |   NaN |   NaN |
| 2001-01-07 |    33 |    12 |  .... |   NaN |   NaN |
|       .... |    .. |    .. |  .... |  .... |  .... |
| 2005-10-10 |    50 |     4 |  .... |   NaN |    10 |
| 2005-10-11 |    51 |     3 |  .... |   NaN |    10 |
|       .... |    .. |    .. |  .... |  .... |  .... |
| 2015-01-01 |    40 |   NaN |  .... |    50 |    99 |
------------------------------------------------------

Till now, I accomplish this with the following code: 到现在为止,我通过以下代码完成此操作:

def _split_by_fk(self, df):
    """
    :param df: pandas.DataFrame
    :param fields: Iterable
    :return: pandas.Panel
    """
    data = dict()
    res = df.groupby('fk_id')
    for r in res:
        fk_id = r[0]
        data[fk_id] = r[1]['field']
    return pd.DataFrame(data)

def get_data(self, start, end):
    s = select([daily_data.c.date, daily_data.c.fk_id, daily_data.c.field])\
        .where(and_(end >= daily_data.c.date, daily_data.c.date >= start))\
        .order_by(daily_data.c.fk_id, daily_data.c.date)
    data = pd.read_sql(s, con=db_engine, index_col='date')
    return self._split_by_fk(data)


>>> get_data('1960-01-01', '1989-12-31')

which does basically: 基本上是这样的:

  1. Query SQL DB via sqlalchemy directly through pandas.read_sql function. 直接通过pandas.read_sql函数通过pandas.read_sql查询SQL DB。
  2. groupby the received DataFrame groupby接收的DataFrame
  3. Iterate over the group result object and put them in a dictionary 遍历组结果对象并将其放入字典中
  4. Convert the dict into a DataFrame . dict转换为DataFrame

To query 29 years of daily data with 13'813 columns takes with the above approach 4min 38s (the whole DataFrame takes up 796.5MB in memory), where %lprun shows that most of the time is spent in the read_sql function and the rest in the _split_by_fk (excerpt of the output): 使用13'813列来查询29年的每日数据需要采用上述方法4分38秒(整个DataFrame占用了796.5 MB的内存),其中%lprun显示大部分时间都花在了read_sql函数上,其余时间花在了_split_by_fk (输出摘录):

% Time   Line Contents
===============================================================
83.8     data = pd.read_sql(s, con=db_engine, index_col='date')
16.2     return self._split_by_fk(data)

My code feels not very elegant as I am collecting all groups in a dictionary to transform them again into a DataFrame. 我正在收集字典中的所有组以将它们再次转换为DataFrame时,我的代码感觉不太优雅。

Now to my actual question: Is there a (more) efficient/pythonic way to "columnize" a pandas.DataFrame in the manner shown above? 现在是我的实际问题: 是否有一种(更有效的)/ pandas.DataFrame的方式以上面显示的方式“ pandas.DataFramepandas.DataFrame


PS: I would be not happy to pointers and hints into more general directions regarding the handling of such data structures and amount of data, tough, I think that it should be possible to solve everything "small data"-style. PS:我不乐意提供有关处理此类数据结构和数据量的更一般性的指示和暗示,我认为应该有可能解决所有“小数据”风格的问题。

If I understand you right, you can do df.pivot(index='date', columns='fk_id', values='field') . 如果我理解正确,则可以执行df.pivot(index='date', columns='fk_id', values='field')

I think that it should be possible to solve everything "small data"-style. 我认为应该有可能解决所有“小数据”风格的问题。

Good luck with that. 祝你好运。 A DataFrame with 12000 columns is unlikely to perform well. 具有12000列的DataFrame不太可能表现良好。

If the combination of fk_id and date is always unique, you can do something like: 如果fk_iddate的组合始终是唯一的,则可以执行以下操作:

df = pd.DataFrame({'fk_id': [1, 2, 3],
                   'date': pd.date_range('1/1/2015', periods=3),
                   'field': [25, 25, 1]})


#         date  field  fk_id
# 0 2015-01-01     25      1
# 1 2015-01-02     24      2
# 2 2015-01-03      1      3

df.groupby(['date', 'fk_id']).agg(lambda x: x.unique()).unstack()


#            field        
# fk_id          1   2   3
# date                    
# 2015-01-01    25 NaN NaN
# 2015-01-02   NaN  24 NaN
# 2015-01-03   NaN NaN   1

If they're not always unique, you may need some more complicated strategy for aggregating values. 如果它们并不总是唯一的,则可能需要一些更复杂的策略来汇总值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM