简体   繁体   中英

Based on Partial string Match fill one data frame column from another dataframe

i'm new to python programming. I have two dataframes df1 containing tags (180k rows) and df2 containing equipment names (1600 rows)

df1:

 Line TagName CLASS 187877 PT_WOA.ZS01_LA120_T05.SB.S2384_LesSwL 10 187878 PT_WOA.ZS01_RB2202_T05.SB.S2385_FLOK 10 187879 PT_WOA.ZS01_LA120_T05.SB._CBAbsHy 10 187880 PT_WOA.ZS01_LA120_T05.SB.S3110_CBAPV 10 187881 PT_WOA.ZS01_LARB2204.SB.S3111_CBRelHy 10

df2:

EquipmentNo EquipmentDescription    Equipment
1311256        Lifting table         LA120
1311257        Roller bed            RB2200
1311258        Lifting table         LT2202
1311259        Roller bed            RB2202
1311260        Roller bed            RB2204

The df2.Equipment is somewhere in between the strings in df1.TagName. i need to match based on if df2 Equipment is in df1 Tagname then df2 (Equipment description and Equipment no ) has to be matched with df1.

Final output should be

        Line                TagName                quipmentdescription   EquipmentNo 
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL     Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK      Roller bed           1311259  
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy         Lifting table        1311256 
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV      Lifting table        1311256
 187881 PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy     Roller bed           1311260

I have tried this now

cols= df2['Equipment'].tolist()
Xs=[]
for i in cols:
    Test = df1.loc[df1.TagName.str.contains(i)] 
    Test['Equip']=i
    Xs.append(Test)

then merge xs and df2 based on "Equipment"

But i am getting this error

first argument must be string or compiled pattern

I would do something like this:

  1. Create a new column indexes where for every Equipment in df2 find a list of Indexes in df1 where df1.TagName contains the Equipment .

  2. Flatten the indexes by creating one row for each item using stack() and reset_index()

  3. Join the flatten df2 with df1 to get all information you want
from io import StringIO
import numpy as np
import pandas as pd
df1=StringIO("""Line;TagName;CLASS
187877;PT_WOA;.ZS01_LA120_T05.SB.S2384_LesSwL;10
187878;PT_WOA;.ZS01_RB2202_T05.SB.S2385_FLOK;10
187879;PT_WOA;.ZS01_LA120_T05.SB._CBAbsHy;10
187880;PT_WOA;.ZS01_LA120_T05.SB.S3110_CBAPV;10
187881;PT_WOA;.ZS01_LARB2204.SB.S3111_CBRelHy;10""")
df2=StringIO("""EquipmentNo;EquipmentDescription;Equipment
1311256;Lifting table;LA120
1311257;Roller bed;RB2200
1311258;Lifting table;LT2202
1311259;Roller bed;RB2202
1311260;Roller bed;RB2204""")
df1=pd.read_csv(df1,sep=";")
df2=pd.read_csv(df2,sep=";")

df2['indexes'] = df2['Equipment'].apply(lambda x: df1.index[df1.TagName.str.contains(str(x)).tolist()].tolist())
indexes = df2.apply(lambda x: pd.Series(x['indexes']),axis=1).stack().reset_index(level=1, drop=True)
indexes.name = 'indexes'
df2 = df2.drop('indexes', axis=1).join(indexes).dropna()
df2.index = df2['indexes']
matches = df2.join(df1, how='inner')
print(matches[['Line','TagName','EquipmentDescription','EquipmentNo']])

OUTPUT:

          Line                          TagName EquipmentDescription  EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL        Lifting table      1311256
187879  PT_WOA      .ZS01_LA120_T05.SB._CBAbsHy        Lifting table      1311256
187880  PT_WOA   .ZS01_LA120_T05.SB.S3110_CBAPV        Lifting table      1311256
187878  PT_WOA   .ZS01_RB2202_T05.SB.S2385_FLOK           Roller bed      1311259
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy           Roller bed      1311260

Initializing the provided dataframes:

import numpy as np
import pandas as pd

df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
                    ['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
                    ['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
                    ['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
                   columns = ['Line', 'TagName', 'CLASS'],
                   index = [187877, 187878, 187879, 187880, 187881])

df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
                    [1311257, 'Roller bed', 'RB2200'],
                    [1311258, 'Lifting table', 'LT2202'],
                    [1311259, 'Roller bed', 'RB2202'],
                    [1311260, 'Roller bed', 'RB2204']],
                  columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])

I would suggest the following:

# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])

# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan

# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
    for index_df2, row_df2 in df2.iterrows():

        # check if 'Equipment' is in 'TagName'
        if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:

            # set 'EquipmentDescription' and 'EquipmentNo'
            df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
            df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']


# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)

This yields the following dataframe:

        Line    TagName                         EquipmentDescription EquipmentNo
187877  PT_WOA  .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table        1311256
187878  PT_WOA  .ZS01_RB2202_T05.SB.S2385_FLOK  Roller bed           1311259
187879  PT_WOA  .ZS01_LA120_T05.SB._CBAbsHy     Lifting table        1311256
187880  PT_WOA  .ZS01_LA120_T05.SB.S3110_CBAPV  Lifting table        1311256
187881  PT_WOA  .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed           1311260

Let me know if this helps.

  • Given df1 and df2 as follows:

df1

|    | Line   | TagName                         |   CLASS |
|---:|:-------|:--------------------------------|--------:|
|  0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL |      10 |
|  1 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK  |      10 |
|  2 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy     |      10 |
|  3 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV  |      10 |
|  4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy |      10 |

df2

|    |   EquipmentNo | EquipmentDescription   | Equipment   |
|---:|--------------:|:-----------------------|:------------|
|  0 |       1311256 | Lifting table          | LA120       |
|  1 |       1311257 | Roller bed             | RB2200      |
|  2 |       1311258 | Lifting table          | LT2202      |
|  3 |       1311259 | Roller bed             | RB2202      |
|  4 |       1311260 | Roller bed             | RB2204      |
  1. Find the unique equipment from Equipment in df2
equipment = df2.Equipment.unique().tolist()
  1. Create an Equipment column in df1 by finding a match in equipment
df1['Equipment'] = df1['TagName'].apply(lambda x: ''.join([part for part in equipment if part in x]))
  1. Merge on Equipment into final form
    • If you don't want the Equipment column in df_final , add .drop(columns=['Equipment']) to the end of the next line of code.
df_final = df1[['Line', 'TagName', 'Equipment']].merge(df2, on='Equipment')

df_final

|    | Line   | TagName                         | Equipment   |   EquipmentNo | EquipmentDescription   |
|---:|:-------|:--------------------------------|:------------|--------------:|:-----------------------|
|  0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL | LA120       |       1311256 | Lifting table          |
|  1 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy     | LA120       |       1311256 | Lifting table          |
|  2 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV  | LA120       |       1311256 | Lifting table          |
|  3 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK  | RB2202      |       1311259 | Roller bed             |
|  4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy | RB2204      |       1311260 | Roller bed             |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM