简体   繁体   English

Python Pandas 数据框:对于一年中的每个月,如果月份不存在,则将当月最后一天的日期添加到索引中,或者删除重复项

[英]Python Pandas dataframe: For each month of the year, add the date with last day in the month to an index if month not present, or remove duplicates

First of all my apologies for the somewhat convoluted title.首先,我为有点令人费解的标题道歉。

I struggled to find a way to succinctly describe what I have been struggling to achieve for a few hours.我努力想办法简洁地描述几个小时以来我一直在努力实现的目标。 Allow me to explain the problem more clearly (FYI I'm using Python 3.6 and Pandas 20.3 ).请允许我更清楚地解释这个问题(仅供参考,我使用的是Python 3.6Pandas 20.3 )。

I have a MultiIndex DataFrame that currently looks like this:我有一个目前看起来像这样的MultiIndex DataFrame

                            d   p
name            paymentDate

Rib Smoth       2011-01-01  0   0
                2011-02-01  0   0
                2011-03-01  0   0
                2011-04-01  0   0
                2011-05-01  0   0
                2011-06-01  0   0
                2011-07-01  0   0
                2011-08-01  0   0
                2011-09-01  0   0
                2011-10-01  0   0
                2011-11-01  0   0
                2011-12-01  0   0
Balrud Big      2011-01-02  1   1
                2011-01-12  2   1
                2011-02-13  2   1
                2011-03-28  3   1
                2011-04-16  2   1
                2011-06-09  1   1
                2011-06-27  3   1
                2011-07-17  2   1
                2011-09-05  1   1
                2011-09-16  2   1
                2011-10-29  3   1
                2011-11-06  1   0
Mr. Bean        2011-01-01  0   0
                2011-02-02  1   0
                        .
                        .
                        .

As you can see, the second level is a series of dates, which refer to the dates people have paid their rent.如您所见,第二级是一系列日期,指的是人们支付房租的日期。 Some renters have missed payments on some months, or paid more than once on other months.一些租房者在某些月份错过了付款,或者在其他月份支付了不止一次。 I need to "homogenise" paymentDate , in other words, I want to have exactly 12 entries for the second level for all renters in the dataframe.我需要“同质化” paymentDate ,换句话说,我希望数据paymentDate所有租户的第二级正好有 12 个条目。

I believe the below should take care of it, but have no idea how to do it:我相信下面应该处理它,但不知道该怎么做:

  1. For each renter, if they have no paymentDate present for any given month, then insert that row with the paymentDate being the last day of that month, and d=3 p=1 .对于每个承租人,如果他们在任何给定月份都没有paymentDate ,则插入该行,其中paymentDate是该月的最后一天,并且d=3 p=1 In the example above, this would entail adding a row for the month of May to Balrud Big like 2011-05-31 1 3 .在上面的示例中,这需要将 5 月份的一行添加到Balrud Big例如2011-05-31 1 3

  2. For each renter, I also need to remove cases where there are two or more paymentDate in the same month.对于每个承租人,我还需要删除同月有两个或更多paymentDate日期的情况。 Again if we look at Balrud Big , we see two entries for January.同样,如果我们查看Balrud Big ,我们会看到一月份的两个条目。 Wherever there are duplicates like this, I wish to keep only the most recent entry, which in this case is 2011-01-12 2 1 .只要有这样的重复,我希望只保留最近的条目,在这种情况下是2011-01-12 2 1

If the above was applied to the example shown, noting that Balrud Big has multiple cases of both missing entries and duplicates, I'd hope to end up with:如果将上述内容应用于所示示例,请注意Balrud Big有多个条目缺失和重复的情况,我希望最终得到:

                            d   p
name            paymentDate

Rib Smoth       2011-01-01  0   0
                2011-02-01  0   0
                2011-03-01  0   0
                2011-04-01  0   0
                2011-05-01  0   0
                2011-06-01  0   0
                2011-07-01  0   0
                2011-08-01  0   0
                2011-09-01  0   0
                2011-10-01  0   0
                2011-11-01  0   0
                2011-12-01  0   0
Balrud Big      2011-01-12  2   1
                2011-02-13  2   1
                2011-03-28  3   1
                2011-04-16  2   1
                2011-05-31  3   1
                2011-06-27  3   1
                2011-07-17  2   1
                2011-08-31  3   1
                2011-09-16  2   1
                2011-10-29  3   1
                2011-11-06  1   0
                2011-12-31  3   1
Mr. Bean        2011-01-01  0   0
                2011-02-02  1   0
                        .
                        .
                        .

Finally, I could then reindex the second level with integers 1-12 (for the 12 months), safe in the knowledge every renter will have an exact 12 month history.最后,我可以使用整数 1-12(对于 12 个月)重新索引第二级,因为知道每个租户都有准确的12 个月历史记录。 Then, through the use of DataFrame.pivot or otherwise, transform the dataframe in order to end up with something like:然后,通过使用DataFrame.pivot或其他方式,转换数据帧以得到如下结果:

                d1  p1  d2  p2  d3  p3  d4  p4  d5  p5  d6  p6  d7  p7  d8  p8  d9  p9  d10  p10  d11  p11  d12  p12
name

Rib Smoth       0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0    0    0    0    0
Balrud Big      2   1   2   1   3   1   2   1   3   1   3   1   2   1   3   1   2   1   3    1    1    0    3    0
Mr. Bean        0   0   1   0   ...(and so on)

It seems like quite a complex task but I imagine there may be some clever tricks using DateTime or Pandas extensive date/time functionality.这似乎是一项相当复杂的任务,但我想使用DateTimePandas广泛的日期/时间功能可能会有一些聪明的技巧。 I've been trying for a while and am still stumped.我已经尝试了一段时间,但仍然被难住了。

Any help on this is greatly appreciated, thank you in advance!非常感谢您对此的任何帮助,在此先感谢您!

EDIT: I have a solution, but it needs a bit of tidying up before I share.编辑:我有一个解决方案,但在我分享之前需要整理一下。

First, create the sample data首先,创建样本数据

import pandas as pd
import numpy as np

arrays = [
    np.array(['Rib Smoth']*12 + ['Balrud Big']*12 + ['Mr. Bean']*2),
    pd.to_datetime([
        '2011-01-01', '2011-02-01', '2011-03-01', '2011-04-01', '2011-05-01',
        '2011-06-01', '2011-07-01', '2011-08-01', '2011-09-01', '2011-10-01',
        '2011-11-01', '2011-12-01', '2011-01-02', '2011-01-12', '2011-02-13',
        '2011-03-28', '2011-04-16', '2011-06-09', '2011-06-27', '2011-07-17',
        '2011-09-05', '2011-09-16', '2011-10-29', '2011-11-06', '2011-01-01',
        '2011-02-02'])
]
df = pd.DataFrame(
    index=pd.MultiIndex.from_tuples(list(zip(*arrays)),
                                    names=['name', 'paymentDate'])
)
df['d'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 2, 1, 3, 2, 1, 2, 3, 1, 0, 1]
df['p'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
# print(df.head(3))
#                        d  p
# name      paymentDate      
# Rib Smoth 2011-01-01   0  0
#           2011-02-01   0  0
#           2011-03-01   0  0

Move paymentDate from index level to a columnpaymentDate从索引级别移动到列

df = df.reset_index(level='paymentDate')
# print(df.head(3))
#           paymentDate  d  p
# name                       
# Rib Smoth  2011-01-01  0  0
# Rib Smoth  2011-02-01  0  0
# Rib Smoth  2011-03-01  0  0

Create a series to be used when grouping by name and month创建按名称和月份分组时要使用的系列

payment_month = df['paymentDate'].dt.to_period('M').rename('month')
# print(payment_month.head(3))
# name
# Rib Smoth    2011-01
# Rib Smoth    2011-02
# Rib Smoth    2011-03
# Name: month, dtype: period[M]

Group, keeping only the last payment in each month组,只保留每月最后一次付款

df = df.groupby(['name', payment_month])[['paymentDate', 'd', 'p']].last()
# print(df.head(3))
#                    paymentDate  d  p
# name       month                    
# Balrud Big 2011-01  2011-01-12  2  1  # Note: last payment in 2011-01
#            2011-02  2011-02-13  2  1
#            2011-03  2011-03-28  3  1

Set the index to the last day of each month, for later use with months for which there is no payment将索引设置为每个月的最后一天,供以后没有付款的月份使用

df.index = df.index.set_levels(df.index.levels[-1].to_timestamp('M'), 'month')
# print(df.head(3))
#                       paymentDate  d  p
# name       month                       
# Balrud Big 2011-01-31  2011-01-12  2  1
#            2011-02-28  2011-02-13  2  1
#            2011-03-31  2011-03-28  3  1

Fill in the dataframe with rows for missing months, by combining each name with all months通过将每个名称与所有月份相结合,用缺失月份的行填充数据框

all_names = df.index.get_level_values('name').unique()
all_months = pd.date_range('2011-01-01', '2011-12-31', freq='M')
df = df.reindex(pd.MultiIndex.from_product(
    [all_names, all_months],
    names=['name', 'all_months']
))
# print(df.head())
#                       paymentDate    d    p
# name       all_months                      
# Balrud Big 2011-01-31  2011-01-12  2.0  1.0
#            2011-02-28  2011-02-13  2.0  1.0
#            2011-03-31  2011-03-28  3.0  1.0
#            2011-04-30  2011-04-16  2.0  1.0
#            2011-05-31         NaT  NaN  NaN # This row is new!

Complete the data with the desired values用所需的值完成数据

no_payment = df['paymentDate'].isnull()
df.loc[no_payment, ['d', 'p']] = [3, 1]
df.loc[no_payment, ['paymentDate']] = df.index.get_level_values(-1)[no_payment]
# print(df.head())
#                       paymentDate    d    p
# name       all_months                      
# Balrud Big 2011-01-31  2011-01-12  2.0  1.0
#            2011-02-28  2011-02-13  2.0  1.0
#            2011-03-31  2011-03-28  3.0  1.0
#            2011-04-30  2011-04-16  2.0  1.0
#            2011-05-31  2011-05-31  3.0  1.0 # The column values are fixed!

Finally, replace the temporary index level by the column with correct values最后,用正确值的列替换临时索引级别

df = df.set_index([df.index.get_level_values('name'), 'paymentDate'])
# print(df.head(3))
#                           d    p
# name       paymentDate          
# Balrud Big 2011-01-12   2.0  1.0
#            2011-02-13   2.0  1.0
#            2011-03-28   3.0  1.0

Restore the correct data types恢复正确的数据类型

df['d'] = df['d'].astype(int)
df['p'] = df['p'].astype(int)
# print(df.head(3))
#                         d  p
# name       paymentDate      
# Balrud Big 2011-01-12   2  1
#            2011-02-13   2  1
#            2011-03-28   3  1

Run some basic tests:运行一些基本测试:

assert (df.loc[('Rib Smoth', slice(None))] == 0).all().all()
assert ('Balrud Big', '2011-01-02') not in df.index
assert ('Balrud Big', '2011-06-09') not in df.index
assert ('Balrud Big', '2011-09-05') not in df.index
assert (df.loc[('Balrud Big', '2011-01-12')] == [2, 1]).all()
assert (df.loc[('Balrud Big', '2011-12-31')] == [3, 1]).all()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM