简体   繁体   English

提取熊猫列中的正则表达式

[英]Extract regex in pandas column

Hi I'm looking to get the number of pieces from different products from a Df column into one new column.您好,我希望将来自不同产品的件数从 Df 列中提取到一个新列中。 For now the numbers comes after the type of product.目前,数字是在产品类型之后。

The data looks like this:数据如下所示:

PRODUCTS
PULSAR AT 20 MG ORAL 30 TAB RECUB
LIPITOR 40 MG 1+1 ORAL 15 TAB
LOFTYL 150 MG ORAL 30 TAB
SOMAZINA 500 MG ORAL 10 COMP RECUB
LOFTYL 30 TAB 150 MG ORAL 
*Keeps going more entries...*

My function looks like this:我的函数如下所示:

df['PZ'] = df['PRODUCTS'].str.extract('([\d]*\.*[\d]+)\s*[tab|cap|grag|past|sob]',flags=re.IGNORECASE)

Products could be [TAB,COMP,AMP, SOB, PAST, GRAG ... and others]产品可以是 [TAB,COMP,AMP, SOB, PAST, GRAG ... 等]

And I want to get something like this:我想得到这样的东西:

PRODUCTS                              PZ
PULSAR AT 20 MG ORAL 30 TAB RECUB     30
LIPITOR 40 MG 1+1 ORAL 15 TAB         15
LOFTYL 150 MG ORAL 30 TAB             30
SOMAZINA 500 MG ORAL 10 COMP RECUB    10
LOFTYL 30 TAB 150 MG ORAL             30

What can I change in my line to get as follows?我可以在我的行中更改什么以获得如下结果?

Thank you for reading me and your help.感谢您阅读我和您的帮助。

You can use您可以使用

import pandas as pd
df = pd.DataFrame({'PRODUCTS':['PULSAR AT 20 MG ORAL 30 TAB RECUB','LIPITOR 40 MG 1+1 ORAL 15 TAB','LOFTYL 150 MG ORAL 30 TAB','SOMAZINA 500 MG ORAL 10 COMP RECUB','LOFTYL 30 TAB 150 MG ORAL']})
rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)'
df['PZ'] = df['PRODUCTS'].str.extract(rx)
>>> df
                             PRODUCTS  PZ
0   PULSAR AT 20 MG ORAL 30 TAB RECUB  30
1       LIPITOR 40 MG 1+1 ORAL 15 TAB  15
2           LOFTYL 150 MG ORAL 30 TAB  30
3  SOMAZINA 500 MG ORAL 10 COMP RECUB  10
4           LOFTYL 30 TAB 150 MG ORAL  30
>>> 

If the words like tab , cap , etc. are whole words and cannot be parts of longer words, you need to add a word boundary at the end of the pattern, ie rx = r'(?i)(\\d*\\.?\\d+)\\s*(?:tab|cap|grag|past|sob|comp)\\b' .如果tabcap完整词,不能是较长词的一部分,则需要在模式末尾添加词边界,即rx = r'(?i)(\\d*\\.?\\d+)\\s*(?:tab|cap|grag|past|sob|comp)\\b'

See the regex demo .请参阅正则表达式演示 Details :详情

  • (?i) - a case insensitive inline modifier (?i) - 不区分大小写的内联修饰符
  • (\\d*\\.?\\d+) - Group 1: zero or more digits, an optional . (\\d*\\.?\\d+) - 第 1 组:零个或多个数字,可选的. and then one or more digits然后一位或多位数字
  • \\s* - zero or more whitespace chars \\s* - 零个或多个空白字符
  • (?:tab|cap|grag|past|sob|comp) - a non-capturing group (so as not to interfere with Series.str.extract output) matching any of the alternative substrings inside it (?:tab|cap|grag|past|sob|comp) - 一个非捕获组(以免干扰Series.str.extract输出)匹配其中的任何替代子字符串
  • \\b - a word boundary. \\b - 单词边界。

Maybe..可能是..

Given a dataframe of (Note: I made the product appear twice for one row as an example in case this may happen)...给定一个数据框(注意:我让产品在一行中出现两次作为示例,以防万一)...

    PRODUCTS
0   PULSAR AT 20 MG ORAL 30 GRAG RECUB
1   LIPITOR 40 MG 1+1 ORAL 15 TAB
2   LOFTYL 150 GRAG ORAL 30 TAB
3   SOMAZINA 500 MG ORAL 10 COMP RECUB
4   LOFTYL 30 TAB 150 MG ORAL
5   *Keeps going more entries...*

Code:代码:

import pandas as pd
import re

data = {'PRODUCTS' : ["PULSAR AT 20 MG ORAL 30 GRAG RECUB", "LIPITOR 40 MG 1+1 ORAL 15 TAB", \
                      "LOFTYL 150 GRAG ORAL 30 TAB", "SOMAZINA 500 MG ORAL 10 COMP RECUB", \
                      "LOFTYL 30 TAB 150 MG ORAL" , "*Keeps going more entries...*"]}

df = pd.DataFrame(data)

# maintain a list of products to find
products = ['TAB', 'COMP', 'AMP', 'SOB', 'PAST', 'GRAG']

def getProduct(x):
    found = list()
    for product in products:
        pattern = r'(\d+)' + ' ' + str(product)
        found.append(re.findall(pattern, x))
    found = list(filter(None, found))
    found = [item for sublist in found for item in sublist]
    found = ", ".join(str(item) for item in found)
    return found

df['PZ'] = [getProduct(row) for row in df['PRODUCTS']]

print(df)

Outputs:输出:

    PRODUCTS                            PZ
0   PULSAR AT 20 MG ORAL 30 GRAG RECUB  30
1   LIPITOR 40 MG 1+1 ORAL 15 TAB       15
2   LOFTYL 150 GRAG ORAL 30 TAB         30, 150
3   SOMAZINA 500 MG ORAL 10 COMP RECUB  10
4   LOFTYL 30 TAB 150 MG ORAL           30
5   *Keeps going more entries...*   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM