![](/img/trans.png)
[英]Pandas: replace values in dataframe with another dataframes values based on condition
[英]Split and replace in one dataframe based on a condition with another dataframe in pandas
我有兩個數據框,都包含 sql 表。
這是我的第一個數據框
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc
AAREN SCIENTIFIC AAREN SCIENTIFIC
我的第二個數據框是:
Name_Extension Company_Type Priority
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
我刪除了標點符號、ASCII 和數字,並將這些數據放入df1
中的cleansed_input
列中。
這cleansed_input
列df1
需要與被檢查Name_Extension
列df2
。 如果從價值cleansed_input
具有任何價值Name_Extension
末尾,則應該被拆分,放在type_input column
的df1
並不僅僅是這樣的,但縮寫。
例如,如果CO
存在於cleansed_column
然后應被縮寫為Company
和放在type_input column
和剩余的文本應在core_type
的柱df1
。 也有優先權,不確定是否需要。
預期輸出:
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc AMG Mdicale Incorporated
AAREN SCIENTIFIC AAREN SCIENTIFIC
我嘗試了很多方法,比如 isin、mask、contains 等,但不知道該放什么。
我收到一條錯誤消息,指出"Series are mutable, they cannot be hashed"
。 當我嘗試使用數據框時,我不確定為什么會出現該錯誤。
我沒有那個代碼,正在使用 jupiter notebook 和 sql server,isin 似乎在 jupiter 中不起作用。
同樣的方式還有另一個分裂要做。 要拆分為 parent_compnay 名稱和別名的 original_input 列。
Here is my code:
import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re
#Connecting SQL with Python
server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'
engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'?
driver=SQL+server')
#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size()
#print(gp)
#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join([''
if ord(i) < 32 or ord(i) > 126 else i for i in x]))
#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.punctuation]))
#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.digits]))
#4.Removing trialing and leading spaces
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())
df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)
df2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
這是您可以實施的可能解決方案:
df = pd.DataFrame({
"Original_Input": ["TECHNOLOGIES S.A",
"A & J INDUSTRIES, LLC",
"A&S DENTAL SERVICES",
"A.M.G Médicale Inc",
"AAREN SCIENTIFIC"],
"Cleansed_Input": ["TECHNOLOGIES SA",
"A J INDUSTRIES LLC",
"AS DENTAL SERVICES",
"AMG Mdicale Inc",
"AAREN SCIENTIFIC"]
})
df_2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()
# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0])
for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
.apply(lambda p: next(( priority
for priority, extension in extensions_list
if p.endswith(extension)), None))
# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")
# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
lambda p: p["Cleansed_Input"]
if p["aux"] == 0
else p["Cleansed_Input"][:p["aux"]].strip(),
axis=1
)
# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
我假設“優先級”列將具有唯一值。 但是,如果不是這種情況,只需對優先級進行排序並根據該順序創建一個索引,如下所示:
df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
另外,下次以任何人都可以輕松加載的格式給出數據示例。 處理您發送的格式很麻煩。
編輯:與問題無關,但可能會有所幫助。 您可以使用以下內容簡化從 1 到 4 的步驟:
data['Cleansed_Input'] = data["Original_Input"] \
.str.replace("[^\w ]+", "") \ # removes non-alpha characters
.str.replace(" +", " ") \ # removes duplicated spaces
.str.strip() # removes spaces before or after the string
編輯 2 :解決方案的 SQL 版本(我使用的是 PostgreSQL,但我使用的是標准 SQL 運算符,因此差異不應該那么大)。
SELECT t.Original_Name,
t.Cleansed_Input,
t.Name_Extension,
t.Company_Type,
t.Priority
FROM (
SELECT df.Original_Name,
df.Cleansed_Input,
df_2.Name_Extension,
df_2.Company_Type,
df_2.Priority,
ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
ON lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1
IIUC,我們可以使用一些基本的正則表達式:
首先,我們刪除所有尾隨和前導空格並按空格分割,這將返回一個列表列表,我們可以使用chain.from_iterable
將其chain.from_iterable
然后我們使用一些帶有熊貓方法str.findall
和str.contains
正則表達式來匹配您的輸入。
from itertools import chain
ext = df2['Name_Extension'].str.strip().str.split('\s+')
ext = list(chain.from_iterable(i for i in ext))
df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]
s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()
df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s
打印(df)
Original_Input Cleansed_Input type_input core_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA NaN NaN
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC LLC A J INDUSTRIES
2 A&S DENTAL SERVICES AS DENTAL SERVICES NaN NaN
3 A.M.G Médicale Inc AMG Mdicale Inc Inc AMG Mdicale
4 AAREN SCIENTIFIC AAREN SCIENTIFIC NaN NaN
假設您已將數據幀讀取為df1
和df2
,第一步是創建 2 個列表 - 一個用於Name_Extension
(鍵),另一個用於Company_Type
(值),如下所示:
keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type'])
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']
下一步將是映射在每個值Cleansed_Input
到Core_Input
和Type_Input
。 我們可以用大熊貓應用方法上Cleansed_Input
列要獲得Core_input
:
def get_core_input(data):
# preprocess
data = str(data).strip().lower()
# check if the data end with any of the keys
for key in keys:
if data.endswith(key):
return data.split(key)[0].strip() # split the data and return the part without the key
return None
df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
Original_Input Cleansed_Input Core_Input Type_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC None NaN
2 A&S DENTAL SERVICES AS DENTAL SERVICES None NaN
3 A.M.G Médicale Inc AMG Mdicale Inc amg mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
要獲取Type_input
:
def get_type_input(data):
# preprocess
data = str(data).strip().lower()
# check if the data end with any of the keys
for idx in range(len(keys)):
if data.endswith(keys[idx]):
return values[idx].strip() # return the value of the corresponding matched key
return None
df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input Cleansed_Input Core_Input Type_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA None None
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC None None
2 A&S DENTAL SERVICES AS DENTAL SERVICES None None
3 A.M.G Médicale Inc AMG Mdicale Inc amg mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None None
這是一個非常容易遵循的解決方案,但不是解決問題的最有效方法,我敢肯定......希望它可以解決您的用例。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.