簡體   English   中英

根據條件在一個數據幀中拆分並替換為熊貓中的另一個數據幀

[英]Split and replace in one dataframe based on a condition with another dataframe in pandas

我有兩個數據框,都包含 sql 表。

這是我的第一個數據框

Original_Input           Cleansed_Input        Core_Input    Type_input
TECHNOLOGIES S.A         TECHNOLOGIES SA        
A & J INDUSTRIES, LLC    A J INDUSTRIES LLC     
A&S DENTAL SERVICES      AS DENTAL SERVICES     
A.M.G Médicale Inc       AMG Mdicale Inc        
AAREN SCIENTIFIC         AAREN SCIENTIFIC   

我的第二個數據框是:

Name_Extension     Company_Type     Priority
co llc             Company LLC       2
Pvt ltd            Private Limited   8
Corp               Corporation       4
CO Ltd             Company Limited   3
inc                Incorporated      5
CO                 Company           1

我刪除了標點符號、ASCII 和數字,並將這些數據放入df1中的cleansed_input列中。

cleansed_inputdf1需要與被檢查Name_Extensiondf2 如果從價值cleansed_input具有任何價值Name_Extension末尾,則應該被拆分,放在type_input columndf1並不僅僅是這樣的,但縮寫。

例如,如果CO存在於cleansed_column然后應被縮寫為Company和放在type_input column和剩余的文本應在core_type的柱df1 也有優先權,不確定是否需要。

預期輸出:

Original_Input          Cleansed_Input        Core_Input       Type_input
TECHNOLOGIES S.A        TECHNOLOGIES SA       TECHNOLOGIES      SA
A & J INDUSTRIES, LLC   A J INDUSTRIES LLC    A J INDUSTRIES    LLC
A&S DENTAL SERVICES     AS DENTAL SERVICES      
A.M.G Médicale Inc      AMG Mdicale Inc       AMG Mdicale       Incorporated
AAREN SCIENTIFIC        AAREN SCIENTIFIC        

我嘗試了很多方法,比如 isin、mask、contains 等,但不知道該放什么。

我收到一條錯誤消息,指出"Series are mutable, they cannot be hashed" 當我嘗試使用數據框時,我不確定為什么會出現該錯誤。

我沒有那個代碼,正在使用 jupiter notebook 和 sql server,isin 似乎在 jupiter 中不起作用。

同樣的方式還有另一個分裂要做。 要拆分為 parent_compnay 名稱和別名的 original_input 列。

Here is my code:

import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re

#Connecting SQL with Python

server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'


engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'? 
driver=SQL+server')

#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size() 
#print(gp)

#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join(['' 
if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.punctuation]))

#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.digits]))

#4.Removing trialing and leading spaces 
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())

df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)


df2 = pd.DataFrame({ 
"Name_Extension": ["llc",
                   "Pvt ltd",
                   "Corp",
                   "CO Ltd",
                   "inc", 
                   "CO",
                   "SA"],
"Company_Type": ["Company LLC",
                 "Private Limited",
                 "Corporation",
                 "Company Limited",
                 "Incorporated",
                 "Company",
                 "Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})

data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)

這是您可以實施的可能解決方案:

df = pd.DataFrame({
    "Original_Input": ["TECHNOLOGIES S.A", 
                       "A & J INDUSTRIES, LLC", 
                       "A&S DENTAL SERVICES", 
                       "A.M.G Médicale Inc", 
                       "AAREN SCIENTIFIC"],
    "Cleansed_Input": ["TECHNOLOGIES SA", 
                       "A J INDUSTRIES LLC", 
                       "AS DENTAL SERVICES", 
                       "AMG Mdicale Inc", 
                       "AAREN SCIENTIFIC"]
})

df_2 = pd.DataFrame({ 
    "Name_Extension": ["llc",
                       "Pvt ltd",
                       "Corp",
                       "CO Ltd",
                       "inc", 
                       "CO",
                       "SA"],
    "Company_Type": ["Company LLC",
                     "Private Limited",
                     "Corporation",
                     "Company Limited",
                     "Incorporated",
                     "Company",
                     "Anonymous Company"],
    "Priority": [2, 8, 4, 3, 5, 1, 9]
})

# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()

# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0]) 
                    for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
    .apply(lambda p: next(( priority 
                            for priority, extension in extensions_list 
                            if p.endswith(extension)), None))

# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of 
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")

# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
    lambda p: p["Cleansed_Input"] 
              if p["aux"] == 0 
              else p["Cleansed_Input"][:p["aux"]].strip(), 
    axis=1
)

# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]

我假設“優先級”列將具有唯一值。 但是,如果不是這種情況,只需對優先級進行排序並根據該順序創建一個索引,如下所示:

df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))

另外,下次以任何人都可以輕松加載的格式給出數據示例。 處理您發送的格式很麻煩。

編輯:與問題無關,但可能會有所幫助。 您可以使用以下內容簡化從 1 到 4 的步驟:

data['Cleansed_Input'] = data["Original_Input"] \
    .str.replace("[^\w ]+", "") \ # removes non-alpha characters
    .str.replace(" +", " ") \ # removes duplicated spaces
    .str.strip() # removes spaces before or after the string

編輯 2 :解決方案的 SQL 版本(我使用的是 PostgreSQL,但我使用的是標准 SQL 運算符,因此差異不應該那么大)。

SELECT t.Original_Name,
       t.Cleansed_Input,
       t.Name_Extension,
       t.Company_Type,
       t.Priority
FROM (
    SELECT df.Original_Name,
           df.Cleansed_Input,
           df_2.Name_Extension,
           df_2.Company_Type,
           df_2.Priority,
           ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
    FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
                 ('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
                 ('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
         LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
                           ('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
                           ('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
            ON  lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1

IIUC,我們可以使用一些基本的正則表達式:

首先,我們刪除所有尾隨和前導空格並按空格分割,這將返回一個列表列表,我們可以使用chain.from_iterable將其chain.from_iterable

然后我們使用一些帶有熊貓方法str.findallstr.contains正則表達式來匹配您的輸入。

from itertools import chain

ext = df2['Name_Extension'].str.strip().str.split('\s+')

ext = list(chain.from_iterable(i for i in ext))

df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]

s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()

df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s

打印(df)

          Original_Input      Cleansed_Input type_input      core_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA        NaN             NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC        LLC  A J INDUSTRIES
2    A&S DENTAL SERVICES  AS DENTAL SERVICES        NaN             NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc        Inc     AMG Mdicale
4       AAREN SCIENTIFIC    AAREN SCIENTIFIC        NaN             NaN

假設您已將數據幀讀取為df1df2 ,第一步是創建 2 個列表 - 一個用於Name_Extension (鍵),另一個用於Company_Type (值),如下所示:

keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type']) 
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']

下一步將是映射在每個值Cleansed_InputCore_InputType_Input 我們可以用大熊貓應用方法上Cleansed_Input列要獲得Core_input

def get_core_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keys
    for key in keys:
        if data.endswith(key):
            return data.split(key)[0].strip() # split the data and return the part without the key
    return None

df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
 Original_Input      Cleansed_Input   Core_Input  Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None         NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None         NaN
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None         NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale         NaN
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None         NaN

要獲取Type_input

def get_type_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keys
    for idx in range(len(keys)):
        if data.endswith(keys[idx]):
            return values[idx].strip() # return the value of the corresponding matched key
    return None

df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input      Cleansed_Input   Core_Input    Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None          None
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None          None
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None          None
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale  incorporated
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None          None

這是一個非常容易遵循的解決方案,但不是解決問題的最有效方法,我敢肯定......希望它可以解決您的用例。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM