如何使用正則表達式從 pandas 列中提取特定數字？

Question

給定 pandas dataframe 中的以下列：


Name: Hockey Canada; NAICS: 711211

Name: Hockey Canada; NAICS: 711211

Name: International AIDS Society; NAICS: 813212

Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211

Name: Health Benefits Trust; NAICS: 524114; Name: Hockey Canada; NAICS: 711211; Name: National Equity Fund; NAICS: 523999, 531110

我想從 pandas 列中的每一行（它們存在的地方）中提取 NAICS 代碼。 所需結果在“expected_result”列中指示。

711211
711211
813212

517112; 551112; 711211

524114; 711211; 523999; 531110

我在某些行中有NaN請使用正則表達式的任何建議，python 將非常有幫助。 我嘗試了正則表達式findall function 但出現錯誤。

我寫了這個 function：

def find_number(text):
    num = re.findall(r'[0-9]+',text)
    return " ".join(num)

我在apply function 中使用它，例如：

df['NAICS']=df['Company'].apply(lambda x: find_number(x))

我收到了這個錯誤：

KeyError Traceback (most recent call last) Input In [81], in <cell line: 1>() ----> 1 df['NAICS']=df['Company'].apply(lambda x: find_number(x))

Answer 1

可能有一些代碼高爾夫球或更友好的方式來實現這一點，但整體邏輯看起來像：

import pandas as pd
import re

NAICSdf = pd.DataFrame(['Name: Hockey Canada; NAICS: 711211','Name: Hockey Canada; NAICS: 711211','Name: International AIDS Society; NAICS: 813212','Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211','Name: Health Benefits Trust; NAICS: 524114; Name: Hockey Canada; NAICS: 711211; Name: National Equity Fund; NAICS: 523999, 531110'], columns=['organization'], )

def findNAICS(organization):
    NAICSList = []
    for found in re.findall(r'NAICS:\s[0-9, ]*', organization):
        for NAICS in found.split(': ')[1].split(', '):
            NAICSList.append(NAICS)
    return '; '.join(NAICSList)

NAICSdf['NAICS'] = NAICSdf['organization'].apply(findNAICS)
print(NAICSdf)

這將在您的 dataframe 中創建一個新列，其中包含來自您的字符串的分號分隔的 NAICS 代碼列表。

Answer 2

您可以使用

df['expected_result'] = df['organization'].astype(str).str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)').str.join(' ').str.findall(r'\d+').str.join("; ")

詳情：

.str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)') - 在NAICS:
.str.join(' ') - 用空格連接找到的匹配項
.str.findall(r'\d+') - 分別提取數字
.str.join("; ") - 用;連接它們和空間。

查看 Pandas 測試：

import pandas as pd
df = pd.DataFrame({'organization':['NAICS: 12342; NAICS: 55555, 66667', 'NAICS:9999']})
df['expected_result'] = df['organization'].astype(str).str.findall(r'\bNAICS:\s*(\d+(?:\s*,\s*\d+)*)').str.join(' ').str.findall(r'\d+').str.join("; ")

Output：

>>> df
                        organization      expected_result
0  NAICS: 12342; NAICS: 55555, 66667  12342; 55555; 66667
1                         NAICS:9999                 9999

Answer 3

如果您希望通過正則表達式對其進行排序，那么您可以這樣做：它只是查找組合在一起的 6 位數字的重復出現。 似乎有些情況下 NAICS 連續有多條記錄，我沒有更精確地使用 go。 如果數據涉及具有 6 位分組的其他記錄，這可能會導致一些不准確。

str1 = 'Name: Hockey Canada; NAICS: 711211'
str2 = 'Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211'

data = [str1, str2]
results = [re.findall('\d{6}', entry) for entry in data]

print(results)

輸出：

[['711211'], ['517112', '551112', '711211']]

如果需要，您可能還想更改分隔符，具體取決於您在將數據輸入記錄之前打算如何處理數據。 並且該列表存儲每行命中的列表，因此可以按照您認為合適的方式對其進行排序。

如何使用正則表達式從 pandas 列中提取特定數字？

問題描述

3 個解決方案

解決方案1
1 已采納 2022-07-27 15:25:44

解決方案2
1 2022-07-27 15:27:18

解決方案3
0 2022-07-27 15:39:27

如何使用正則表達式從 pandas 列中提取特定數字？

問題描述

3 個解決方案

解決方案1 1 已采納 2022-07-27 15:25:44

解決方案2 1 2022-07-27 15:27:18

解決方案3 0 2022-07-27 15:39:27

解決方案1
1 已采納 2022-07-27 15:25:44

解決方案2
1 2022-07-27 15:27:18

解決方案3
0 2022-07-27 15:39:27