[英]extracting specific information from string with varying patterns
import pandas as pd
df = pd.DataFrame({'Reference':["PO: TK42-8",
"PO GQ5-42",
"PO:HEA-238/239",
"PO: 4501005609 Purchaser: Mariana Toledo Blanco",
"FITN7-26",
"PO#CP4-62",
"PO 4501004752 Purchaser Yang Gao / Split from S94964",
"GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"]
})
From the above df, i've been trying, for a while with minimum success, to extract two pieces of info if available.从上面的 df 中,我一直在尝试提取两条信息(如果可用),但成功率最低。 Thereby creating 2 new columns as seen in the desired df below.
从而创建 2 个新列,如下面所需的 df 所示。
df2 = pd.DataFrame({'Reference':["PO: TK42-8",
"PO GQ5-42",
"PO:HEA-238/239",
"PO: 4501005609 Purchaser: Mariana Toledo Blanco",
"FITN7-26",
"PO#CP4-62",
"PO 4501004752 Purchaser Yang Gao / Split from S94964",
"GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"],
"PO":["TK42-8", "GQ5-42", "HEA-238/239", "4501005609", "FITN7-26","CP4-62", "4501004752", "GQY6-17" ],
"Purchaser":["", "", "", "Mariana Toledo Blanco", "","", "Yang Gao", "" ],
})
so far, i've been able to see a bit of success with:到目前为止,我已经在以下方面取得了一些成功:
df['PO'] = df['Reference'].str.extract(r"PO:.*?([ \w.\S-]+)")
df['Purchaser'] = df['Reference'].str.extract(r"Purchaser.*?([ \w.*]+)")
however, i'm missing how to correctly script for all the subtle possibilities for each case within each function parenthesis.但是,我缺少如何为每个函数括号内每个案例的所有微妙可能性正确编写脚本。
Extract POs with提取 PO
>>> df['Reference'].str.extract(r"(?:^(?=[A-Z\d/-]+$)|\bPO\W*)([A-Z\d/-]+)")
0
0 TK42-8
1 GQ5-42
2 HEA-238/239
3 4501005609
4 FITN7-26
5 CP4-62
6 4501004752
7 GQY6-17
EXPLANATION解释
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[A-Z\d/-]+ any character of: 'A' to 'Z', digits
(0-9), '/', '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
PO 'PO'
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z\d/-]+ any character of: 'A' to 'Z', digits (0-
9), '/', '-' (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
Extract purchasers with提取购买者
>>> df['Reference'].str.extract(r"\bPurchaser\W+(\w(?:[\s\w]*\w)?)").fillna("")
0
0
1
2
3 Mariana Toledo Blanco
4
5
6 Yang Gao
7
EXPLANATION解释
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
Purchaser 'Purchaser'
--------------------------------------------------------------------------------
\W+ non-word characters (all but a-z, A-Z, 0-
9, _) (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[\s\w]* any character of: whitespace (\n, \r,
\t, \f, and " "), word characters (a-
z, A-Z, 0-9, _) (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
) end of \1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.