简体   繁体   English

带有条件的Python正则表达式剥离标点符号

[英]Python regex stripping punctuation with conditions

I have a dataframe of various company names, and I need to be able to perform a groupby function on them. 我有一个包含各种公司名称的数据框,并且需要能够对它们执行groupby功能。 However, the company names are often law firms, which can be presented in a variety of different ways (ie. "Akin Gump", "Akin, Gump", "Akin,Gump", "Akin Gump Strauss Hauer & Feld LLP", "Akin Gump Strauss Hauer Feld", you get the idea). 但是,公司名称通常是律师事务所,可以采用多种不同的方式来表示(例如,“ Akin Gump”,“ Akin,Gump”,“ Akin,Gump”,“ Akin Gump Strauss Hauer&Feld LLP”,“ “ Akin Gump Strauss Hauer Feld”,您就明白了。

My current code, below, works well in most situations, except where the spacing is wrong in the original text - like "Akin,Gump" (which becomes "AkinGump") or "Akin Gump Strauss Hauer & Feld LLP" which becomes "Akin Gump Strauss Hauer Feld" (two spaces between Hauer and Feld). 下面的我当前的代码在大多数情况下都可以正常工作,除非原始文本中的空格是错误的,例如“ Akin,Gump”(变为“ AkinGump”)或“ Akin Gump Strauss Hauer&Feld LLP”(变为“ Akin”)阿甘·斯特劳斯·豪尔·费尔德”(豪尔和费尔德之间有两个空格)。

table = string.maketrans("", "")
company_name = company_name.translate(table, string.punctuation)
stopwords = ['LLC', 'INC', 'PLLC', 'LP', 'LTD', 'PLC', 'LLP']
company_name = ' '.join(filter(lambda x: x not in stopwords, company_name.split()))

I assume there is a regex solution, but I am not good at that at all. 我假设有一个正则表达式解决方案,但我一点都不擅长。

I'd make a first passthrough with regex to correct the offending characters so that they don't cause issues in the rest of the code: 我会先使用regex进行更正,以更正令人讨厌的字符,以免它们在其余的代码中引起问题:

import re

re.sub(" *[&,] *"," ", company_name) #Add any other special characters you might want

This will replace any special characters and all the spaces surrounding them with just a single spaces, meaning that it will successfully go through the rest of your code without issue. 这将用一个空格替换所有特殊字符及其周围的所有空格,这意味着它将成功遍历代码的其余部分而不会出现问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM