简体   繁体   English

Python function 用于 dataframe 中的月份日期计数

[英]Python function for Count of date of month number within a dataframe

Edited to add payload example and complete script Edited again to modify the script and format my problem better编辑以添加有效负载示例并完成脚本再次编辑以修改脚本并更好地格式化我的问题

I am creating a script to analyse payment cycles from a bank statement of multiple payments.我正在创建一个脚本来分析来自多笔付款的银行对账单的付款周期。 I am working out the most frequent day of week and date of month and selecting the highest as either day of week along with its position and frequency of payments within a given month or a specific date.我正在计算一周中最频繁的一天和一个月中的日期,并选择最高的一天作为一周中的任何一天以及它的 position 和给定月份或特定日期内的付款频率。

Where its a specific date, I expect the second and third highest dates to be either side of where the highest used date falls on a weekend or on a public holiday.如果它是一个特定的日期,我希望第二和第三高的日期位于周末或公共假期的最高使用日期的任一侧。

I created a function to do a count of the days in a week without needing to sort the dataframe and to allow me to add that value as a column in the dataframe rather than it ending up as a list.我创建了一个 function 来计算一周中的天数,而无需对 dataframe 进行排序,并允许我将该值作为列添加到 Z6A8064B5DF479455500553C47C50 列表中,而不是以列表结尾。

What I need help on?我需要什么帮助?

That was fine for working through 7 days and counting based on filters but, when doing it by the date of the month, my function for it has 31 if statements.这对于工作 7 天并根据过滤器进行计数来说很好,但是,在当月的日期之前,我的 function 有 31 个 if 语句。 How can I make this more concise and gain the same outcome?我怎样才能使它更简洁并获得相同的结果?

Also, I have the problem with: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using.loc[row_indexer,col_indexer] = value instead另外,我遇到的问题是: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using.loc[row_indexer,col_indexer] = value instead SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using.loc[row_indexer,col_indexer] = value instead which I simply can't get to go away. SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using.loc[row_indexer,col_indexer] = value instead我根本无法到达 go 。 Copy vs View, I really don't mind whether it uses a copy or view, I just want to get rid of the warming one way or the other. Copy vs View,我真的不介意它是否使用副本或视图,我只是想以一种或另一种方式摆脱变暖。

Script and Example Data脚本和示例数据

Below is the part of the script I need to make tidier:以下是我需要使脚本更整洁的部分:

# function to filter by date of month to perform a count of each

import pandas as pd
from datetime import datetime

df = pd.read_csv(r"C:\Users\mattl\OneDrive\Desktop\netflix - only.csv")

# Convert to a Date format here
df['new_date']=df['date'].apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))

# Extend data frame with month, day of month and week day
df['month'] = df['new_date'].apply(lambda x: x.month)
df['dom'] = df['new_date'].apply(lambda x: x.day)
df['dow']=df['new_date'].apply(lambda x: x.strftime("%A"))

# function to filter by weekday to perform a count of each
def totalForWeekDay(weekDay):
    filter = df.dow.value_counts();
    #print(filter);
    if weekDay == 'Sunday':
        return filter['Sunday'];
    if weekDay == 'Monday':
        return filter['Monday'];
    if weekDay == 'Tuesday':
        return filter['Tuesday'];
    if weekDay == 'Wednesday':
        return filter['Wednesday'];
    if weekDay == 'Thursday':
        return filter['Thursday'];
    if weekDay == 'Friday':
        return filter['Friday'];
    if weekDay == 'Saturday':
        return filter['Saturday'];

# function to filter by date of month to perform a count of each
def totalForMonthDate(monthDay):
    filter = df.dom.value_counts();
    #print(filter);
    if monthDay == 31:
        return filter[31];
    if monthDay == 30:
        return filter[30];
    if monthDay == 29:
        return filter[29];
    if monthDay == 28:
        return filter[28];
    if monthDay == 27:
        return filter[27];
    if monthDay == 26:
        return filter[26];
    if monthDay == 25:
        return filter[25];
    if monthDay == 24:
        return filter[24];
    if monthDay == 23:
        return filter[23];
    if monthDay == 22:
        return filter[22];
    if monthDay == 21:
        return filter[21];
    if monthDay == 20:
        return filter[20];
    if monthDay == 19:
        return filter[19];
    if monthDay == 18:
        return filter[18];
    if monthDay == 17:
        return filter[17];
    if monthDay == 16:
        return filter[16];
    if monthDay == 15:
        return filter[15];
    if monthDay == 14:
        return filter[14];
    if monthDay == 13:
        return filter[13];
    if monthDay == 12:
        return filter[12];
    if monthDay == 11:
        return filter[11];
    if monthDay == 10:
        return filter[10];
    if monthDay == 9:
        return filter[9];
    if monthDay == 8:
        return filter[8];
    if monthDay == 7:
        return filter[7];
    if monthDay == 6:
        return filter[6];
    if monthDay == 5:
        return filter[5];
    if monthDay == 4:
        return filter[4];
    if monthDay == 3:
        return filter[3];
    if monthDay == 2:
        return filter[2];
    if monthDay == 1:
        return filter[1];

# Add column which calls the function resulting it total count of week_day
df['dow_total'] = df['dow'].apply(lambda row: totalForWeekDay(row));

# Add formula and column to dataframe which counts the month number
df['dom_total'] = df['dom'].apply(lambda row: totalForMonthDate(row));

# Show results
print(df)

if df["dom_total"].max() >= df["dow_total"].max():
    # Determine the top day of month result
    top_dom_tot = df.loc[df['dom_total'] == df['dom_total'].max()]

    # isolate the top day of month
    top_day_of_month = (top_dom_tot['dom'][0])
    print('Top day of month is:')
    print(top_day_of_month)

    # find dates in list where the date of month is NOT the highest number
    dfa = df.loc[df['dom'] != top_day_of_month]

    # Determine number of days forwards (positive) or back (negative)
    dfa['days_diff'] = df['dom'] - top_day_of_month
    print('Payments that are not related to the top day per month')
    print(dfa)

Now for the example csv payload:现在对于示例 csv 有效负载:

type,party,date, debit , credit 
payment,Netflix,22/01/2021,-$19.99, $-   
payment,Netflix,22/02/2021,-$19.99, $-   
payment,Netflix,22/03/2021,-$19.99, $-   
payment,Netflix,22/04/2021,-$19.99, $-   
payment,Netflix,24/05/2021,-$19.99, $-   
payment,Netflix,22/06/2021,-$19.99, $-   
payment,Netflix,22/07/2021,-$19.99, $-   
payment,Netflix,23/08/2021,-$19.99, $-   
payment,Netflix,22/09/2021,-$19.99, $-   
payment,Netflix,22/10/2021,-$19.99, $-   

Thanks for your advice!谢谢你的建议!

  1. In totalForMonthDate() , you can replace this series of if statements with two lines:totalForMonthDate()中,您可以将这一系列 if 语句替换为两行:

     def totalForMonthDate(monthDay): filter = df.dom.value_counts() return filter[monthDay]

    Of course, you're also running value_counts() once for every row in your dataframe, when it's the same for the whole dataframe.当然,您还为 dataframe 中的每一行运行一次 value_counts(),而整个 dataframe 的值相同。 That's inefficient.那是低效的。 You can replace this by doing value_counts() once and using map to translate the values:您可以通过执行 value_counts() 一次并使用 map 来转换值来替换它:

     df['dom_total'] = df['dom'].map(df['dom'].value_counts())

    Not only is this shorter (1 line vs 4 lines) but it's faster too.这不仅更短(1 行对 4 行),而且速度也更快。

  2. You're getting a SettingWithCopyWarning because you're using.loc to filter down the dataframe, then modifying that filtered subset.您收到 SettingWithCopyWarning 是因为您使用 .loc 过滤 dataframe,然后修改过滤后的子集。 The simplest way to fix this is to throw in a copy when you're subsetting the dataframe.解决此问题的最简单方法是在对 dataframe 进行子集化时放入副本。

     dfa = df.loc[df['dom'].= top_day_of_month].copy()

    Note: the code afterward which adds a new column won't affect the original dataframe.注意:后面添加新列的代码不会影响原来的 dataframe。

Here is the full source code:这是完整的源代码:

import pandas as pd
import io
from datetime import datetime

s = """type,party,date, debit , credit 
payment,Netflix,22/01/2021,-$19.99, $-   
payment,Netflix,22/02/2021,-$19.99, $-   
payment,Netflix,22/03/2021,-$19.99, $-   
payment,Netflix,22/04/2021,-$19.99, $-   
payment,Netflix,24/05/2021,-$19.99, $-   
payment,Netflix,22/06/2021,-$19.99, $-   
payment,Netflix,22/07/2021,-$19.99, $-   
payment,Netflix,23/08/2021,-$19.99, $-   
payment,Netflix,22/09/2021,-$19.99, $-   
payment,Netflix,22/10/2021,-$19.99, $-   """

df = pd.read_csv(io.StringIO(s))

# Convert to a Date format here
df['new_date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

# Extend data frame with month, day of month and week day
df['month'] = df['new_date'].dt.month
df['dom'] = df['new_date'].dt.day
df['dow'] = df['new_date'].dt.strftime("%A")

# Add column which calls the function resulting it total count of week_day
df['dow_total'] = df['dow'].map(df['dow'].value_counts())

# Add formula and column to dataframe which counts the month number
df['dom_total'] = df['dom'].map(df['dom'].value_counts())

# Show results
print(df)

if df["dom_total"].max() >= df["dow_total"].max():
    # Determine the top day of month result
    top_dom_tot = df.loc[df['dom_total'] == df['dom_total'].max()]

    # isolate the top day of month
    top_day_of_month = (top_dom_tot['dom'][0])
    print('Top day of month is:')
    print(top_day_of_month)

    # find dates in list where the date of month is NOT the highest number
    dfa = df.loc[df['dom'] != top_day_of_month].copy()

    # Determine number of days forwards (positive) or back (negative)
    dfa['days_diff'] = df['dom'] - top_day_of_month
    print('Payments that are not related to the top day per month')
    print(dfa)

Anyways, hope that helped.无论如何,希望有所帮助。 Cool project!很酷的项目!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM