如何使用正则表达式从 pandas 列中提取特定数字？

Question

Given the following column in pandas dataframe:给定 pandas dataframe 中的以下列：


Name: Hockey Canada; NAICS: 711211

Name: Hockey Canada; NAICS: 711211

Name: International AIDS Society; NAICS: 813212

Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211

Name: Health Benefits Trust; NAICS: 524114; Name: Hockey Canada; NAICS: 711211; Name: National Equity Fund; NAICS: 523999, 531110

I'd like to extract the NAICS code from each row (where they exist) in the pandas column.我想从 pandas 列中的每一行（它们存在的地方）中提取 NAICS 代码。 The desired result is indicated in column "expected_result".所需结果在“expected_result”列中指示。

711211
711211
813212

517112; 551112; 711211

524114; 711211; 523999; 531110

I have NaN in some rows please any suggestion using regex and python will be very helpful.我在某些行中有NaN请使用正则表达式的任何建议，python 将非常有帮助。 I tried the regex findall function but I got an error.我尝试了正则表达式findall function 但出现错误。

I write this function:我写了这个 function：

def find_number(text):
    num = re.findall(r'[0-9]+',text)
    return " ".join(num)

I used it in apply function like:我在apply function 中使用它，例如：

df['NAICS']=df['Company'].apply(lambda x: find_number(x))

I got this error:我收到了这个错误：

KeyError Traceback (most recent call last) Input In [81], in <cell line: 1>() ----> 1 df['NAICS']=df['Company'].apply(lambda x: find_number(x))

Answer 1

There's likely some code golfy or more dataframe-friendly way to pull this off, but the overall logic will look something like:可能有一些代码高尔夫球或更友好的方式来实现这一点，但整体逻辑看起来像：

import pandas as pd
import re

NAICSdf = pd.DataFrame(['Name: Hockey Canada; NAICS: 711211','Name: Hockey Canada; NAICS: 711211','Name: International AIDS Society; NAICS: 813212','Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211','Name: Health Benefits Trust; NAICS: 524114; Name: Hockey Canada; NAICS: 711211; Name: National Equity Fund; NAICS: 523999, 531110'], columns=['organization'], )

def findNAICS(organization):
    NAICSList = []
    for found in re.findall(r'NAICS:\s[0-9, ]*', organization):
        for NAICS in found.split(': ')[1].split(', '):
            NAICSList.append(NAICS)
    return '; '.join(NAICSList)

NAICSdf['NAICS'] = NAICSdf['organization'].apply(findNAICS)
print(NAICSdf)

That will create a new column in your dataframe with a semicolon delimited list of NAICS codes from your string.这将在您的 dataframe 中创建一个新列，其中包含来自您的字符串的分号分隔的 NAICS 代码列表。

Answer 2

You can use您可以使用

df['expected_result'] = df['organization'].astype(str).str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)').str.join(' ').str.findall(r'\d+').str.join("; ")

Details :详情：

.str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)') - find all comma separated numbers after NAICS: .str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)') - 在NAICS:
.str.join(' ') - joins the found matches with a space .str.join(' ') - 用空格连接找到的匹配项
.str.findall(r'\d+') - extracts numbers separately .str.findall(r'\d+') - 分别提取数字
.str.join("; ") - joins them with ; .str.join("; ") - 用;连接它们and space.和空间。

See a Pandas test:查看 Pandas 测试：

import pandas as pd
df = pd.DataFrame({'organization':['NAICS: 12342; NAICS: 55555, 66667', 'NAICS:9999']})
df['expected_result'] = df['organization'].astype(str).str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)').str.join(' ').str.findall(r'\d+').str.join("; ")

Output: Output：

>>> df
                        organization      expected_result
0  NAICS: 12342; NAICS: 55555, 66667  12342; 55555; 66667
1                         NAICS:9999                 9999

Answer 3

If you wish to sort this by regex then you can do this: It simply looks for the recurrence of 6 digits combined together.如果您希望通过正则表达式对其进行排序，那么您可以这样做：它只是查找组合在一起的 6 位数字的重复出现。 As it seems like there are some cases of NAICS having multiple records in a row i didn't go more precise.似乎有些情况下 NAICS 连续有多条记录，我没有更精确地使用 go。 That might cause some inaccuracy if the data involves other records with 6 digit groupings.如果数据涉及具有 6 位分组的其他记录，这可能会导致一些不准确。

str1 = 'Name: Hockey Canada; NAICS: 711211'
str2 = 'Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211'

data = [str1, str2]
results = [re.findall('\d{6}', entry) for entry in data]

print(results)

Ouput:输出：

[['711211'], ['517112', '551112', '711211']]

You might also want to change the delimiter if needed, depending on how you intend on processing the data before entering it into the records.如果需要，您可能还想更改分隔符，具体取决于您在将数据输入记录之前打算如何处理数据。 And the list stores a list of hits per row so this can be sorted as you see fit.并且该列表存储每行命中的列表，因此可以按照您认为合适的方式对其进行排序。

如何使用正则表达式从 pandas 列中提取特定数字？

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-07-27 15:25:44

解决方案2
1 2022-07-27 15:27:18

解决方案3
0 2022-07-27 15:39:27

如何使用正则表达式从 pandas 列中提取特定数字？

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-07-27 15:25:44

解决方案2 1 2022-07-27 15:27:18

解决方案3 0 2022-07-27 15:39:27

解决方案1
1 已采纳 2022-07-27 15:25:44

解决方案2
1 2022-07-27 15:27:18

解决方案3
0 2022-07-27 15:39:27