带有条件的Python正则表达式剥离标点符号

Question

I have a dataframe of various company names, and I need to be able to perform a groupby function on them. 我有一个包含各种公司名称的数据框，并且需要能够对它们执行groupby功能。 However, the company names are often law firms, which can be presented in a variety of different ways (ie. "Akin Gump", "Akin, Gump", "Akin,Gump", "Akin Gump Strauss Hauer & Feld LLP", "Akin Gump Strauss Hauer Feld", you get the idea). 但是，公司名称通常是律师事务所，可以采用多种不同的方式来表示（例如，“ Akin Gump”，“ Akin，Gump”，“ Akin，Gump”，“ Akin Gump Strauss Hauer＆Feld LLP”，“ “ Akin Gump Strauss Hauer Feld”，您就明白了。

My current code, below, works well in most situations, except where the spacing is wrong in the original text - like "Akin,Gump" (which becomes "AkinGump") or "Akin Gump Strauss Hauer & Feld LLP" which becomes "Akin Gump Strauss Hauer Feld" (two spaces between Hauer and Feld). 下面的我当前的代码在大多数情况下都可以正常工作，除非原始文本中的空格是错误的，例如“ Akin，Gump”（变为“ AkinGump”）或“ Akin Gump Strauss Hauer＆Feld LLP”（变为“ Akin”）阿甘·斯特劳斯·豪尔·费尔德”（豪尔和费尔德之间有两个空格）。

table = string.maketrans("", "")
company_name = company_name.translate(table, string.punctuation)
stopwords = ['LLC', 'INC', 'PLLC', 'LP', 'LTD', 'PLC', 'LLP']
company_name = ' '.join(filter(lambda x: x not in stopwords, company_name.split()))

I assume there is a regex solution, but I am not good at that at all. 我假设有一个正则表达式解决方案，但我一点都不擅长。

Answer 1

I'd make a first passthrough with regex to correct the offending characters so that they don't cause issues in the rest of the code: 我会先使用regex进行更正，以更正令人讨厌的字符，以免它们在其余的代码中引起问题：

import re

re.sub(" *[&,] *"," ", company_name) #Add any other special characters you might want

This will replace any special characters and all the spaces surrounding them with just a single spaces, meaning that it will successfully go through the rest of your code without issue. 这将用一个空格替换所有特殊字符及其周围的所有空格，这意味着它将成功遍历代码的其余部分而不会出现问题。

带有条件的Python正则表达式剥离标点符号

问题描述

1 个解决方案

解决方案1
0 2016-07-25 19:27:26

带有条件的Python正则表达式剥离标点符号

问题描述

1 个解决方案

解决方案1 0 2016-07-25 19:27:26

解决方案1
0 2016-07-25 19:27:26