简体   繁体   English

Python - 通过复制创建新的 DF 列 - 现有列值的部分字符串匹配

[英]Python - Create a new DF column by copying - partial string match from existing column values

I have a dataframe with 50k records with one of the column value like below.我有一个 dataframe 有 50k 条记录,其中一个列值如下所示。

DF\n东风\n

Index.       COLUMN\n

0.       ABC-1M-Deliveryorder
1.       KGF-ORDERDelivery-2Y
2.       DEFGHIABC1M-OPEN
3.       KGFABC
4.       ABC-3Y-ORDER

I am looking for key words - 3Y , 3M , 2Y and 1Y from COLUMN and if found the values need to be copied to a new DF column name TENOR with 3Y , 3M , 1M etc. In case not found it can show FALSE or NAN我正在从COLUMN中寻找关键字 - 3Y3M2Y1Y ,如果找到这些值,则需要使用3Y3M1M等将这些值复制到新的 DF 列名称TENOR中。如果找不到它可以显示FALSENAN

I tried with below code我试过下面的代码

df['Tenor'] = ""\n

df['Tenor'] = df.column.apply(lambda x: x in ['3Y','3M,'1Y','1M']

This returns as FALSE in all rows for the new column.这在新列的所有行中都返回为FALSE Can you please advise what is best way to meet my requirement?你能告诉我什么是满足我要求的最好方法吗?

You can use pandas.Series.str.contains with a regex:您可以将pandas.Series.str.contains与正则表达式一起使用:

import pandas as pd

df = pd.DataFrame(dict(
    COLUMN = [
        'ABC-1M-Deliveryorder','KGF-ORDERDelivery-2Y',
        'DEFGHIABC1M-OPEN', 'KGFABC', 'ABC-3Y-ORDER'
    ]
))

df['Tenor'] = df['COLUMN'].str.contains('3Y|3M|2Y|1Y|1M', regex=True)

Edit: OP asked the follow up question:编辑:OP问了后续问题:

The above code snippet is returning TRUE wherever the column finds the string 2Y, 3Y etc.. But i need the output as below Index Column NEW 0 ABC-1M-Deliveryorder 1M 1 KGF-ORDERDelivery-2Y 2Y 2 DEFGHIABC1M-OPEN 1M 3 KGFABC Nan 4 ABC-3Y-ORDER 3Y上面的代码片段在列找到字符串 2Y、3Y 等的地方返回 TRUE。但我需要 output 如下索引列 NEW 0 ABC-1M-Deliveryorder 1M 1 KGF-ORDERDelivery-2Y 2Y 2 DEFGHIABC1M-OPEN 1M 3 KGFABC Nan 4 ABC-3Y-ORDER 3Y

If that is the case then you may want to use a custom function and pandas.Series.apply like so:如果是这种情况,那么您可能希望像这样使用自定义 function 和pandas.Series.apply

import pandas as pd

df = pd.DataFrame(dict(
    COLUMN = [
        'ABC-1M-Deliveryorder','KGF-ORDERDelivery-2Y',
        'DEFGHIABC1M-OPEN', 'KGFABC', 'ABC-3Y-ORDER'
    ]
))

def find_substring(x):
    for y in ('3Y','3M','2Y','1Y','1M'):
        if y in x:
            return y

df['Tenor'] = df['COLUMN'].apply(find_substring)

print(df)

output: output:

                 COLUMN Tenor
0  ABC-1M-Deliveryorder    1M
1  KGF-ORDERDelivery-2Y    2Y
2      DEFGHIABC1M-OPEN    1M
3                KGFABC  None
4          ABC-3Y-ORDER    3Y

python tutor link to example python 导师链接到示例

The above code snippet is returning TRUE wherever the column finds the string 2Y, 3Y etc..上面的代码片段在列找到字符串 2Y、3Y 等的任何地方都返回 TRUE。

But i need the output as below code output但我需要 output 如下代码 output

'''' ''''

Index Column NEW 0 ABC-1M-Deliveryorder 1M 1 KGF-ORDERDelivery-2Y 2Y 2 DEFGHIABC1M-OPEN 1M 3 KGFABC Nan 4 ABC-3Y-ORDER 3Y索引栏 NEW 0 ABC-1M-Deliveryorder 1M 1 KGF-ORDERDelivery-2Y 2Y 2 DEFGHIABC1M-OPEN 1M 3 KGFABC Nan 4 ABC-3Y-ORDER 3Y

'''' ''''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM