简体   繁体   English

查找子字符串,然后使用该字符串中的数字来计算熊猫中的新列

[英]Finding a substring then using digits within that string to calculate a new column in pandas

I have a dataframe that has 5 columns scraped from a site. 我有一个数据帧,该数据帧具有从站点抓取的5列。 What I want to do is create an additional column based on the contents of the first two columns, for example, say the data looks like this: 我想做的是基于前两列的内容创建一个附加列,例如,数据看起来像这样:

Duration                                                               Issues in 1 year
Pay by Annual Recurring Payment                                         51
Pay every 3 months by Recurring Payment                                 51
Pay every 6 months by Recurring Payment                                 51
First 3 issues for £3, then £15 recurring every 6 months thereafter     14
One off payment - Pay for 1 year                                        14
First 6 issues for £10, then £15 recurring every 6 months thereafter     9
One-Off Payment – Pay for 9 issues                                      12
One-Off Payment – Pay for 20 issues                                     51
First year for £29.99, then £20 recurring every 6 months thereafter     13

I want to have an additional column that contains the number of months in the deal based on the 'Duration' string and (when nessecery) calculates the number of months by using the 'Issues in 1 year' column as well. 我想增加一个列,其中包含基于“持续时间”字符串的交易中的月数,并且(必要时)也使用“一年中的问题”列来计算月数。

Ive managed to get want I want for most of them by copying Duration to a new column and using 'str.contains': Ive通过将Duration复制到新列并使用“ str.contains”,设法获得了我大多数人想要的:

df1['Months'] = df1['Duration']
df1.loc[df1['Months'].str.contains('1 year|annual', case=False), 'Months'] = 12
df1.loc[df1['Months'].str.contains('6 months by', case=False), 'Months'] = 6
df1.loc[df1['Months'].str.contains('3 months by', case=False), 'Months'] = 3

The above does seem a little clunky and I feel like there could be a slicker solution, but it works. 上面的代码确实有些笨拙,我觉得可能会有一个更巧妙的解决方案,但是它确实有效。

When it comes to the Durations that have a fixed cost for the first 3 or 6 issues then im only interested in the number of months for the intial payment, so have used: 对于前3或6期的固定期限的持续时间,我只对初始付款的月数感兴趣,因此使用了:

df1.loc[df1['Months'].str.contains('first 3', case=False), 'Months'] = round((12 / df1.Issues) * 3,0)

The above does appear to be working but could be more efficient. 上面的方法似乎确实有效,但是可能会更有效。

Im now super stuck for the 'Pay for x issues' type. 我现在对于“为x个问题付款”类型超级受阻。 I need to be able to identify the strings with that pattern and then also use the number within it to calculate the answer, I have tried to following applying the same methodology as before but using extract but I get and unexpected keyword arguement 'case': 我需要能够识别具有该模式的字符串,然后也使用其中的数字来计算答案,我尝试遵循与以前相同的方法,但是使用了提取,但是我得到了意外的关键字争辩“案例”:

df1.loc[df1['Months'].str.contains('Pay for (.+?) issues', case=False), 'Months'] = round((12 / df1.Issues) * df1.loc[df1['Months'].str.extract('Pay for (.+?) issues', case=False), 'Months'],0)

Im not sure if my regex logic is correct as im still getting to grips with it but I copied it from this post . 我不确定我的正则表达式逻辑是否正确,因为我仍在处理它,但我从这篇文章中复制了它。

To (try and) simplfy; 简单地(简单地) I am trying to achieve: 我正在努力实现:

If ' One-Off Payment – Pay for 20 issues ' contains '...Pay for x issues...' = 12 / Issues(51) * 20 如果“ 一次性付款-支付20 ”包含“ ...支付x期...” = 12 /期(51)* 20

Which would give an end result of: 最终结果为:

Duration                                  Issues in 1 year      Months
One-Off Payment – Pay for 20 issues       51                    5

Also if there is a simple way of doing the above I assume the logic could be applied to the 'Pay every x months...' strings. 另外,如果有一种简单的方法可以实现上述目的,那么我假设可以将逻辑应用于“每x个月支付...”字符串。

Any help would be super appreciated, I am new and have tried to find an answer for days but without results. 任何帮助将不胜感激,我是新来的,并试图寻找答案了好几天,但没有结果。

Assuming 'Pay for x issues' statements doesn't contain any other number, you can try this. 假设“为x个问题付款”语句不包含任何其他数字,则可以尝试此操作。

import re
import pandas as pd

## sample data frame
df = pd.DataFrame({'Duration':['Pay by Annual Recurring Payment',                                         
'Pay every 3 months by Recurring Payment',                               
'Pay every 6 months by Recurring Payment',                               
'First 3 issues for £3, then £15 recurring every 6 months thereafter',
'One off payment - Pay for 1 year',
'First 6 issues for £10, then £15 recurring every 6 months thereafter',
'One-Off Payment – Pay for 9 issues',                                 
'One-Off Payment – Pay for 20 issues',  
'First year for £29.99, then £20 recurring every 6 months thereafter'], 'Issues_in_1_year' : [51, 51, 51,14,14,9,12,51,13]  })

## extract month and pay value in separate columns
df['Months'] = df['Duration'].str.extract('(\d+) months by').fillna(-1).astype(int)
df.loc[df['Duration'].str.contains('(\d+) year| (\d+) annual | Annual'),'Months'] = 12
df['Pay_Value'] = df['Duration'].str.extract('Pay for (\d+)').fillna(-1).astype(int)

## calculate solution
def get_sol(row):
    if row.Months == -1 and row.Pay_Value == -1:
         return 0
    elif row.Months != -1 and row.Pay_Value == -1:
        return round((12/ row.Issues_in_1_year) * row.Months)
    elif row.Months == -1 and row.Pay_Value != -1:
        return round((12/ row.Issues_in_1_year) * row.Pay_Value) 

df['solution'] = df.apply(get_sol, axis=1)
print(df)

And, the output looks like this where solution is the column we have calculated (few rows): 并且,输出看起来像这样,其中solution是我们计算的列(几行):

    Duration                                 Issues_in_1_year   Months  Pay_Value   solution
0   Pay by Annual Recurring Payment                 51           12        -1       3
1   Pay every 3 months by Recurring Payment         51            3        -1       1
2   Pay every 6 months by Recurring Payment         51            6        -1       1
3   One-Off Payment – Pay for 20 issues             51           -1        20       5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM