提取熊猫列中的正则表达式

Question

Hi I'm looking to get the number of pieces from different products from a Df column into one new column.您好，我希望将来自不同产品的件数从 Df 列中提取到一个新列中。 For now the numbers comes after the type of product.目前，数字是在产品类型之后。

The data looks like this:数据如下所示：

PRODUCTS
PULSAR AT 20 MG ORAL 30 TAB RECUB
LIPITOR 40 MG 1+1 ORAL 15 TAB
LOFTYL 150 MG ORAL 30 TAB
SOMAZINA 500 MG ORAL 10 COMP RECUB
LOFTYL 30 TAB 150 MG ORAL 
*Keeps going more entries...*

My function looks like this:我的函数如下所示：

df['PZ'] = df['PRODUCTS'].str.extract('([\d]*\.*[\d]+)\s*[tab|cap|grag|past|sob]',flags=re.IGNORECASE)

Products could be [TAB,COMP,AMP, SOB, PAST, GRAG ... and others]产品可以是 [TAB,COMP,AMP, SOB, PAST, GRAG ... 等]

And I want to get something like this:我想得到这样的东西：

PRODUCTS                              PZ
PULSAR AT 20 MG ORAL 30 TAB RECUB     30
LIPITOR 40 MG 1+1 ORAL 15 TAB         15
LOFTYL 150 MG ORAL 30 TAB             30
SOMAZINA 500 MG ORAL 10 COMP RECUB    10
LOFTYL 30 TAB 150 MG ORAL             30

What can I change in my line to get as follows?我可以在我的行中更改什么以获得如下结果？

Thank you for reading me and your help.感谢您阅读我和您的帮助。

Answer 1

You can use您可以使用

import pandas as pd
df = pd.DataFrame({'PRODUCTS':['PULSAR AT 20 MG ORAL 30 TAB RECUB','LIPITOR 40 MG 1+1 ORAL 15 TAB','LOFTYL 150 MG ORAL 30 TAB','SOMAZINA 500 MG ORAL 10 COMP RECUB','LOFTYL 30 TAB 150 MG ORAL']})
rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)'
df['PZ'] = df['PRODUCTS'].str.extract(rx)
>>> df
                             PRODUCTS  PZ
0   PULSAR AT 20 MG ORAL 30 TAB RECUB  30
1       LIPITOR 40 MG 1+1 ORAL 15 TAB  15
2           LOFTYL 150 MG ORAL 30 TAB  30
3  SOMAZINA 500 MG ORAL 10 COMP RECUB  10
4           LOFTYL 30 TAB 150 MG ORAL  30
>>>

See the regex demo .请参阅正则表达式演示。 Details :详情：

(?i) - a case insensitive inline modifier (?i) - 不区分大小写的内联修饰符
(\\d*\\.?\\d+) - Group 1: zero or more digits, an optional . (\\d*\\.?\\d+) - 第 1 组：零个或多个数字，可选的. and then one or more digits然后一位或多位数字
\\s* - zero or more whitespace chars \\s* - 零个或多个空白字符
(?:tab|cap|grag|past|sob|comp) - a non-capturing group (so as not to interfere with Series.str.extract output) matching any of the alternative substrings inside it (?:tab|cap|grag|past|sob|comp) - 一个非捕获组（以免干扰Series.str.extract输出）匹配其中的任何替代子字符串
\\b - a word boundary. \\b - 单词边界。

Answer 2

Maybe..可能是..

Given a dataframe of (Note: I made the product appear twice for one row as an example in case this may happen)...给定一个数据框（注意：我让产品在一行中出现两次作为示例，以防万一）...

    PRODUCTS
0   PULSAR AT 20 MG ORAL 30 GRAG RECUB
1   LIPITOR 40 MG 1+1 ORAL 15 TAB
2   LOFTYL 150 GRAG ORAL 30 TAB
3   SOMAZINA 500 MG ORAL 10 COMP RECUB
4   LOFTYL 30 TAB 150 MG ORAL
5   *Keeps going more entries...*

Code:代码：

import pandas as pd
import re

data = {'PRODUCTS' : ["PULSAR AT 20 MG ORAL 30 GRAG RECUB", "LIPITOR 40 MG 1+1 ORAL 15 TAB", \
                      "LOFTYL 150 GRAG ORAL 30 TAB", "SOMAZINA 500 MG ORAL 10 COMP RECUB", \
                      "LOFTYL 30 TAB 150 MG ORAL" , "*Keeps going more entries...*"]}

df = pd.DataFrame(data)

# maintain a list of products to find
products = ['TAB', 'COMP', 'AMP', 'SOB', 'PAST', 'GRAG']

def getProduct(x):
    found = list()
    for product in products:
        pattern = r'(\d+)' + ' ' + str(product)
        found.append(re.findall(pattern, x))
    found = list(filter(None, found))
    found = [item for sublist in found for item in sublist]
    found = ", ".join(str(item) for item in found)
    return found

df['PZ'] = [getProduct(row) for row in df['PRODUCTS']]

print(df)

Outputs:输出：

    PRODUCTS                            PZ
0   PULSAR AT 20 MG ORAL 30 GRAG RECUB  30
1   LIPITOR 40 MG 1+1 ORAL 15 TAB       15
2   LOFTYL 150 GRAG ORAL 30 TAB         30, 150
3   SOMAZINA 500 MG ORAL 10 COMP RECUB  10
4   LOFTYL 30 TAB 150 MG ORAL           30
5   *Keeps going more entries...*

提取熊猫列中的正则表达式

问题描述

2 个解决方案

解决方案1
0 已采纳 2021-07-29 21:56:21

解决方案2
0 2021-07-29 22:26:30

提取熊猫列中的正则表达式

问题描述

2 个解决方案

解决方案1 0 已采纳 2021-07-29 21:56:21

解决方案2 0 2021-07-29 22:26:30

解决方案1
0 已采纳 2021-07-29 21:56:21

解决方案2
0 2021-07-29 22:26:30