简体   繁体   English

在 Pandas DataFrame 中拆分列表

[英]Splitting a List inside a Pandas DataFrame

I have a csv file that contains a number of columns.我有一个包含许多列的 csv 文件。 Using pandas, I read this csv file into a dataframe and have a datetime index and five or six other columns.使用 Pandas,我将这个 csv 文件读入一个数据帧,并有一个日期时间索引和五六个其他列。

One of the columns is a list of timestamps (example below with index)其中一列是时间戳列表(下面带有索引的示例)

CreateDate     TimeStamps
4/1/11         [Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')]
4/2/11         [Timestamp('2014-01-31 00:00:00')]
6/8/11         [Timestamp('2012-08-31 00:00:00'), Timestamp('2012-09-30 00:00:00'), Timestamp('2012-11-07 00:00:00'), Timestamp('2013-01-10 00:00:00'), Timestamp('2013-07-25 00:00:00')]

What I'd like to do is convert the timestamp column into separate rows for each timestamp listed.我想要做的是将时间戳列转换为列出的每个时间戳的单独行。 For example, for row 1 it would convert to 4 rows and row 2 would convert to 1 row.例如,对于第 1 行,它将转换为 4 行,第 2 行将转换为 1 行。 I realize I'd need to reset the index to be able to do this, which is fine.我意识到我需要重置索引才能做到这一点,这很好。

Everything I've tried just ends up getting out into left field (taking the values and create a list outside of pandas, etc)我尝试过的一切最终都进入了左场(获取值并在熊猫之外创建一个列表等)

Any suggestions appreciated.任何建议表示赞赏。

If you want to stay in pure pandas you can throw in a tricky groupby and apply which ends up boiling down to a one liner if you don't count the column rename.如果你想留在纯 Pandas 中,你可以加入一个棘手的groupbyapply它,如果你不计算列重命名,它最终会沸腾到一个衬里。

In [1]: import pandas as pd

In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}

In [3]: df = pd.DataFrame(d)

In [4]: df.head()
Out[4]: 
     date                                                 ts
0  4/1/11  [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
1  4/2/11                              [2014-01-31 00:00:00]

In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)

In [6]: df_new.columns = ['date','ts']

In [7]: df_new.head()
Out[7]: 
     date         ts
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31

Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.由于目标是获取列的值(在本例中为日期)并为您打算从列表创建的多行的所有值重复它,因此考虑熊猫索引很有用。

We want the date to become the single index for the new rows so we use groupby which puts the desired row value into an index.我们希望日期成为新行的单一索引,因此我们使用groupby将所需的行值放入索引中。 Then inside that operation I want to split only this list for this date which is what apply will do for us.然后在该操作中,我只想拆分此日期的列表,这就是apply将为我们做的事情。

I'm passing apply a pandas Series which consists of a single list but I can access that list via a .values[0] which pushes the sole row of the Series to an array with a single entry.我正在通过apply由单个列表组成的熊猫Series ,但我可以通过.values[0]访问该列表,该列表将Series的唯一行推送到具有单个条目的数组。

To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame .要将列表转换为一组将传递回索引日期的行,我可以将其DataFrame This incurs the penalty of picking up an extra index but we end up dropping that.这会导致选择额外索引的惩罚,但我们最终放弃了它。 We could make this an index itself but that would preclude dupe values.我们可以将其作为索引本身,但这会排除重复值。

Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index .一旦将其传回,我就有了一个多索引,但我可以通过reset_index将其强制转换为我们想要的行格式。 Then we simply drop the unwanted index.然后我们简单地删除不需要的索引。

It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.这听起来很复杂,但实际上我们只是利用 Pandas 函数的自然行为来避免显式迭代或循环。

Speed wise this tends to be pretty good and since it relies on apply any parallelization tricks that work with apply work here.速度明智这往往是相当不错的,因为它依赖于apply任何与apply工作一起工作的并行化技巧。

Optionally if you want it to be robust to multiple dates each with a nested list:或者,如果您希望它对多个日期都具有嵌套列表的健壮性:

df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))

at which point the one liner is getting dense and you should probably throw into a function.在这一点上,一个班轮变得密集,你可能应该投入一个函数。

The way I did it was split the list into seperate columns, and then melt ed it to put each timestamp in a separate row.我这样做的方法是将列表拆分为单独的列,然后将其melt以将每个时间戳放在单独的行中。

In [48]: df = pd.DataFrame([[1,2,[1,2,4]],[4,5,[1,3]],],columns=['a','b','TimeStamp'])
    ...: df
Out[48]: 
   a  b  TimeStamp
0  1  2  [1, 2, 4]
1  4  5     [1, 3]

You can convert the column to a list and then back to a DataFrame to split it into columns:您可以将列转换为列表,然后返回到DataFrame以将其拆分为列:

In [53]: TScolumns = pd.DataFrame(df.TimeStamp.tolist(), )
    ...: TScolumns
Out[53]: 
   0  1   2
0  1  2   4
1  1  3 NaN

And then splice it onto the original dataframe然后将其拼接到原始数据帧上

In [90]: df = df.drop('TimeStamp',axis=1)
In [58]: split = pd.concat([df, TScolumns], axis=1)
    ...: split
Out[58]: 
   a  b  0  1   2
0  1  2  1  2   4
1  4  5  1  3 NaN

Finally, use melt to get it into the shape you want:最后,用melt把它做成你想要的形状:

In [89]: pd.melt(split, id_vars=['a', 'b'], value_name='TimeStamp')
Out[89]: 
   a  b variable  TimeStamp
0  1  2        0          1
1  4  5        0          1
2  1  2        1          2
3  4  5        1          3
4  1  2        2          4
5  4  5        2        NaN

This doesn't feel very pythonic, but it works (provided your createDate is unique!)这感觉不是很pythonic,但它有效(前提是您的 createDate 是独一无二的!)

Apply will only return more rows than it gets with a groupby, so we're going to use groupby artificially (ie groupby a column of unique values, so each group is one line). Apply 只会返回比使用 groupby 获得的行更多的行,因此我们将人为地使用 groupby(即 groupby 一列唯一值,因此每个组是一行)。

def splitRows(x):

    # Extract the actual list of time-stamps. 
    theList = x.TimeStamps.iloc[0]

    # Each row will be a dictionary in this list.
    listOfNewRows = list()

    # Iterate over items in list of timestamps, 
    # putting each one in a dictionary to later convert to a row, 
    # then adding the dictionary to a list. 

    for i in theList:
        newRow = dict()
        newRow['CreateDate'] = x.CreateDate.iloc[0]
        newRow['TimeStamps'] = i
        listOfNewRows.append(newRow)

    # Now convert these dictionaries into rows in a new dataframe and return it. 
    return pd.DataFrame(listOfNewRows)


df.groupby('CreateDate', as_index = False, group_keys = False).apply(splitRows)

Followup: If CreateDate is NOT unique, you can just reset the index to a new column and groupby that.后续:如果 CreateDate 不是唯一的,您可以将索引重置为新列并按该列分组。

Probably not the best way from performance perspective, but still, you can leverage itertools package:从性能的角度来看,可能不是最好的方法,但是,您仍然可以利用itertools包:

from pandas import DataFrame, Timestamp
import itertools

d = {'date': ['4/1/11', '4/2/11'], 'ts': [[Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')], [Timestamp('2014-01-31 00:00:00')]]}
df = DataFrame(d)

res = df.to_dict()
data = []
for x in res['date'].keys():
  data.append(itertools.izip_longest([res['date'][x]], res['ts'][x], fillvalue=res['date'][x]))

new_data = list(itertools.chain.from_iterable(data))
df2 = DataFrame(new_data, columns=['date', 'timestamp'])
print df2

Will print :将打印:

     date  timestamp
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31

A newer way to do this would be to use explode ( documentation )一种较新的方法是使用explode文档

import pandas as pd

d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}

test_df = pd.DataFrame(d)

result_df = test_df.explode('ts')

result_df.head()

Output输出

    date    ts
0   4/1/11  2012-02-29
0   4/1/11  2012-03-31
0   4/1/11  2012-04-25
0   4/1/11  2012-06-30
1   4/2/11  2014-01-31

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM