[英]Pattern Match in List of Strings, Create New Column in pandas
I have a pandas dataframe with the following general format: 我有一个具有以下常规格式的熊猫数据框:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value: 我还有一个要匹配的产品代码列表(不幸的是,我必须进行NLP提取,因此这不是一个干净的匹配),然后使用匹配的列表值创建一个新列:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions. 我不太担心会发生碰撞。
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" 如果我只需要使用包含和列表值与“ |”串联的True / False指示符,这将很容易 for alternation, but I am a bit stumped now on how I would create a column value of the exact match.
进行交替,但现在我对如何创建完全匹配的列值有些困惑。 Any tips or trick appreciated!
任何技巧或窍门表示赞赏!
Since you're not worried about collisions, you can join your product_name
list with the |
由于您不必担心冲突,因此可以将您的
product_name
列表与|
一起加入|
operator, and use that as a regex: 运算符,并将其用作正则表达式:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result: 结果:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.