简体   繁体   English

从具有不同模式的字符串中提取特定信息

[英]extracting specific information from string with varying patterns

import pandas as pd
df = pd.DataFrame({'Reference':["PO: TK42-8", 
                                "PO GQ5-42", 
                                "PO:HEA-238/239", 
                                "PO: 4501005609  Purchaser: Mariana Toledo Blanco", 
                                "FITN7-26", 
                                "PO#CP4-62",
                                "PO 4501004752  Purchaser Yang Gao / Split from S94964",
                                "GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"]
                   })

From the above df, i've been trying, for a while with minimum success, to extract two pieces of info if available.从上面的 df 中,我一直在尝试提取两条信息(如果可用),但成功率最低。 Thereby creating 2 new columns as seen in the desired df below.从而创建 2 个新列,如下面所需的 df 所示。

df2 = pd.DataFrame({'Reference':["PO: TK42-8", 
                                "PO GQ5-42", 
                                "PO:HEA-238/239", 
                                "PO: 4501005609  Purchaser: Mariana Toledo Blanco", 
                                "FITN7-26", 
                                "PO#CP4-62",
                                "PO 4501004752  Purchaser Yang Gao / Split from S94964",
                                "GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"],
                    
                    "PO":["TK42-8", "GQ5-42", "HEA-238/239", "4501005609", "FITN7-26","CP4-62", "4501004752", "GQY6-17" ],
                    "Purchaser":["", "", "", "Mariana Toledo Blanco", "","", "Yang Gao", "" ],
                   })

so far, i've been able to see a bit of success with:到目前为止,我已经在以下方面取得了一些成功:

df['PO'] = df['Reference'].str.extract(r"PO:.*?([ \w.\S-]+)")
df['Purchaser'] = df['Reference'].str.extract(r"Purchaser.*?([ \w.*]+)")

however, i'm missing how to correctly script for all the subtle possibilities for each case within each function parenthesis.但是,我缺少如何为每个函数括号内每个案例的所有微妙可能性正确编写脚本。

Extract POs with提取 PO

>>> df['Reference'].str.extract(r"(?:^(?=[A-Z\d/-]+$)|\bPO\W*)([A-Z\d/-]+)")
             0
0       TK42-8
1       GQ5-42
2  HEA-238/239
3   4501005609
4     FITN7-26
5       CP4-62
6   4501004752
7      GQY6-17

EXPLANATION解释

--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    ^                        the beginning of the string
--------------------------------------------------------------------------------
    (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
      [A-Z\d/-]+               any character of: 'A' to 'Z', digits
                               (0-9), '/', '-' (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
      $                        before an optional \n, and the end of
                               the string
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    PO                       'PO'
--------------------------------------------------------------------------------
    \W*                      non-word characters (all but a-z, A-Z, 0-
                             9, _) (0 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [A-Z\d/-]+               any character of: 'A' to 'Z', digits (0-
                             9), '/', '-' (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1

Extract purchasers with提取购买者

>>> df['Reference'].str.extract(r"\bPurchaser\W+(\w(?:[\s\w]*\w)?)").fillna("")
                       0
0                       
1                       
2                       
3  Mariana Toledo Blanco
4                       
5                       
6               Yang Gao
7                       

EXPLANATION解释

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  Purchaser                'Purchaser'
--------------------------------------------------------------------------------
  \W+                      non-word characters (all but a-z, A-Z, 0-
                           9, _) (1 or more times (matching the most
                           amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      [\s\w]*                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), word characters (a-
                               z, A-Z, 0-9, _) (0 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
      \w                       word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM