Python Pandas 数据框：对于一年中的每个月，如果月份不存在，则将当月最后一天的日期添加到索引中，或者删除重复项

Question

First of all my apologies for the somewhat convoluted title.首先，我为有点令人费解的标题道歉。

I struggled to find a way to succinctly describe what I have been struggling to achieve for a few hours.我努力想办法简洁地描述几个小时以来我一直在努力实现的目标。 Allow me to explain the problem more clearly (FYI I'm using Python 3.6 and Pandas 20.3 ).请允许我更清楚地解释这个问题（仅供参考，我使用的是Python 3.6和Pandas 20.3 ）。

I have a MultiIndex DataFrame that currently looks like this:我有一个目前看起来像这样的MultiIndex DataFrame ：

                            d   p
name            paymentDate

Rib Smoth       2011-01-01  0   0
                2011-02-01  0   0
                2011-03-01  0   0
                2011-04-01  0   0
                2011-05-01  0   0
                2011-06-01  0   0
                2011-07-01  0   0
                2011-08-01  0   0
                2011-09-01  0   0
                2011-10-01  0   0
                2011-11-01  0   0
                2011-12-01  0   0
Balrud Big      2011-01-02  1   1
                2011-01-12  2   1
                2011-02-13  2   1
                2011-03-28  3   1
                2011-04-16  2   1
                2011-06-09  1   1
                2011-06-27  3   1
                2011-07-17  2   1
                2011-09-05  1   1
                2011-09-16  2   1
                2011-10-29  3   1
                2011-11-06  1   0
Mr. Bean        2011-01-01  0   0
                2011-02-02  1   0
                        .
                        .
                        .

As you can see, the second level is a series of dates, which refer to the dates people have paid their rent.如您所见，第二级是一系列日期，指的是人们支付房租的日期。 Some renters have missed payments on some months, or paid more than once on other months.一些租房者在某些月份错过了付款，或者在其他月份支付了不止一次。 I need to "homogenise" paymentDate , in other words, I want to have exactly 12 entries for the second level for all renters in the dataframe.我需要“同质化” paymentDate ，换句话说，我希望数据paymentDate所有租户的第二级正好有 12 个条目。

I believe the below should take care of it, but have no idea how to do it:我相信下面应该处理它，但不知道该怎么做：

For each renter, if they have no paymentDate present for any given month, then insert that row with the paymentDate being the last day of that month, and d=3 p=1 .对于每个承租人，如果他们在任何给定月份都没有paymentDate ，则插入该行，其中paymentDate是该月的最后一天，并且d=3 p=1 。 In the example above, this would entail adding a row for the month of May to Balrud Big like 2011-05-31 1 3 .在上面的示例中，这需要将 5 月份的一行添加到Balrud Big例如2011-05-31 1 3 。
For each renter, I also need to remove cases where there are two or more paymentDate in the same month.对于每个承租人，我还需要删除同月有两个或更多paymentDate日期的情况。 Again if we look at Balrud Big , we see two entries for January.同样，如果我们查看Balrud Big ，我们会看到一月份的两个条目。 Wherever there are duplicates like this, I wish to keep only the most recent entry, which in this case is 2011-01-12 2 1 .只要有这样的重复，我希望只保留最近的条目，在这种情况下是2011-01-12 2 1 。

If the above was applied to the example shown, noting that Balrud Big has multiple cases of both missing entries and duplicates, I'd hope to end up with:如果将上述内容应用于所示示例，请注意Balrud Big有多个条目缺失和重复的情况，我希望最终得到：

                            d   p
name            paymentDate

Rib Smoth       2011-01-01  0   0
                2011-02-01  0   0
                2011-03-01  0   0
                2011-04-01  0   0
                2011-05-01  0   0
                2011-06-01  0   0
                2011-07-01  0   0
                2011-08-01  0   0
                2011-09-01  0   0
                2011-10-01  0   0
                2011-11-01  0   0
                2011-12-01  0   0
Balrud Big      2011-01-12  2   1
                2011-02-13  2   1
                2011-03-28  3   1
                2011-04-16  2   1
                2011-05-31  3   1
                2011-06-27  3   1
                2011-07-17  2   1
                2011-08-31  3   1
                2011-09-16  2   1
                2011-10-29  3   1
                2011-11-06  1   0
                2011-12-31  3   1
Mr. Bean        2011-01-01  0   0
                2011-02-02  1   0
                        .
                        .
                        .

Finally, I could then reindex the second level with integers 1-12 (for the 12 months), safe in the knowledge every renter will have an exact 12 month history.最后，我可以使用整数 1-12（对于 12 个月）重新索引第二级，因为知道每个租户都有准确的12 个月历史记录。 Then, through the use of DataFrame.pivot or otherwise, transform the dataframe in order to end up with something like:然后，通过使用DataFrame.pivot或其他方式，转换数据帧以得到如下结果：

                d1  p1  d2  p2  d3  p3  d4  p4  d5  p5  d6  p6  d7  p7  d8  p8  d9  p9  d10  p10  d11  p11  d12  p12
name

Rib Smoth       0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0    0    0    0    0
Balrud Big      2   1   2   1   3   1   2   1   3   1   3   1   2   1   3   1   2   1   3    1    1    0    3    0
Mr. Bean        0   0   1   0   ...(and so on)

It seems like quite a complex task but I imagine there may be some clever tricks using DateTime or Pandas extensive date/time functionality.这似乎是一项相当复杂的任务，但我想使用DateTime或Pandas广泛的日期/时间功能可能会有一些聪明的技巧。 I've been trying for a while and am still stumped.我已经尝试了一段时间，但仍然被难住了。

Any help on this is greatly appreciated, thank you in advance!非常感谢您对此的任何帮助，在此先感谢您！

EDIT: I have a solution, but it needs a bit of tidying up before I share.编辑：我有一个解决方案，但在我分享之前需要整理一下。

Answer 1

First, create the sample data首先，创建样本数据

import pandas as pd
import numpy as np

arrays = [
    np.array(['Rib Smoth']*12 + ['Balrud Big']*12 + ['Mr. Bean']*2),
    pd.to_datetime([
        '2011-01-01', '2011-02-01', '2011-03-01', '2011-04-01', '2011-05-01',
        '2011-06-01', '2011-07-01', '2011-08-01', '2011-09-01', '2011-10-01',
        '2011-11-01', '2011-12-01', '2011-01-02', '2011-01-12', '2011-02-13',
        '2011-03-28', '2011-04-16', '2011-06-09', '2011-06-27', '2011-07-17',
        '2011-09-05', '2011-09-16', '2011-10-29', '2011-11-06', '2011-01-01',
        '2011-02-02'])
]
df = pd.DataFrame(
    index=pd.MultiIndex.from_tuples(list(zip(*arrays)),
                                    names=['name', 'paymentDate'])
)
df['d'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 2, 1, 3, 2, 1, 2, 3, 1, 0, 1]
df['p'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
# print(df.head(3))
#                        d  p
# name      paymentDate      
# Rib Smoth 2011-01-01   0  0
#           2011-02-01   0  0
#           2011-03-01   0  0

Move paymentDate from index level to a column将paymentDate从索引级别移动到列

df = df.reset_index(level='paymentDate')
# print(df.head(3))
#           paymentDate  d  p
# name                       
# Rib Smoth  2011-01-01  0  0
# Rib Smoth  2011-02-01  0  0
# Rib Smoth  2011-03-01  0  0

Create a series to be used when grouping by name and month创建按名称和月份分组时要使用的系列

payment_month = df['paymentDate'].dt.to_period('M').rename('month')
# print(payment_month.head(3))
# name
# Rib Smoth    2011-01
# Rib Smoth    2011-02
# Rib Smoth    2011-03
# Name: month, dtype: period[M]

Group, keeping only the last payment in each month组，只保留每月最后一次付款

df = df.groupby(['name', payment_month])[['paymentDate', 'd', 'p']].last()
# print(df.head(3))
#                    paymentDate  d  p
# name       month                    
# Balrud Big 2011-01  2011-01-12  2  1  # Note: last payment in 2011-01
#            2011-02  2011-02-13  2  1
#            2011-03  2011-03-28  3  1

Set the index to the last day of each month, for later use with months for which there is no payment将索引设置为每个月的最后一天，供以后没有付款的月份使用

df.index = df.index.set_levels(df.index.levels[-1].to_timestamp('M'), 'month')
# print(df.head(3))
#                       paymentDate  d  p
# name       month                       
# Balrud Big 2011-01-31  2011-01-12  2  1
#            2011-02-28  2011-02-13  2  1
#            2011-03-31  2011-03-28  3  1

Fill in the dataframe with rows for missing months, by combining each name with all months通过将每个名称与所有月份相结合，用缺失月份的行填充数据框

all_names = df.index.get_level_values('name').unique()
all_months = pd.date_range('2011-01-01', '2011-12-31', freq='M')
df = df.reindex(pd.MultiIndex.from_product(
    [all_names, all_months],
    names=['name', 'all_months']
))
# print(df.head())
#                       paymentDate    d    p
# name       all_months                      
# Balrud Big 2011-01-31  2011-01-12  2.0  1.0
#            2011-02-28  2011-02-13  2.0  1.0
#            2011-03-31  2011-03-28  3.0  1.0
#            2011-04-30  2011-04-16  2.0  1.0
#            2011-05-31         NaT  NaN  NaN # This row is new!

Complete the data with the desired values用所需的值完成数据

no_payment = df['paymentDate'].isnull()
df.loc[no_payment, ['d', 'p']] = [3, 1]
df.loc[no_payment, ['paymentDate']] = df.index.get_level_values(-1)[no_payment]
# print(df.head())
#                       paymentDate    d    p
# name       all_months                      
# Balrud Big 2011-01-31  2011-01-12  2.0  1.0
#            2011-02-28  2011-02-13  2.0  1.0
#            2011-03-31  2011-03-28  3.0  1.0
#            2011-04-30  2011-04-16  2.0  1.0
#            2011-05-31  2011-05-31  3.0  1.0 # The column values are fixed!

Finally, replace the temporary index level by the column with correct values最后，用正确值的列替换临时索引级别

df = df.set_index([df.index.get_level_values('name'), 'paymentDate'])
# print(df.head(3))
#                           d    p
# name       paymentDate          
# Balrud Big 2011-01-12   2.0  1.0
#            2011-02-13   2.0  1.0
#            2011-03-28   3.0  1.0

Restore the correct data types恢复正确的数据类型

df['d'] = df['d'].astype(int)
df['p'] = df['p'].astype(int)
# print(df.head(3))
#                         d  p
# name       paymentDate      
# Balrud Big 2011-01-12   2  1
#            2011-02-13   2  1
#            2011-03-28   3  1

Run some basic tests:运行一些基本测试：

assert (df.loc[('Rib Smoth', slice(None))] == 0).all().all()
assert ('Balrud Big', '2011-01-02') not in df.index
assert ('Balrud Big', '2011-06-09') not in df.index
assert ('Balrud Big', '2011-09-05') not in df.index
assert (df.loc[('Balrud Big', '2011-01-12')] == [2, 1]).all()
assert (df.loc[('Balrud Big', '2011-12-31')] == [3, 1]).all()

Python Pandas 数据框：对于一年中的每个月，如果月份不存在，则将当月最后一天的日期添加到索引中，或者删除重复项

问题描述

1 个解决方案

解决方案1
0 2021-01-02 17:48:35

Python Pandas 数据框：对于一年中的每个月，如果月份不存在，则将当月最后一天的日期添加到索引中，或者删除重复项

问题描述

1 个解决方案

解决方案1 0 2021-01-02 17:48:35

解决方案1
0 2021-01-02 17:48:35