简体   繁体   中英

How to use pandas dataframe to add a column to a dataframe that labels data as 1 or 0 based on matching columns in another df

I'm working on labeling some Medicare datasets for machine learning algorithm as fraudulent or non-fraudulent using the Pandas dataframes. The labeling involves matching the NPI numbers in the DMPOES dataset to the NPI number in the LEIE dataset. Each dataset includes a column named "NPI". I need to be able to find out if each row in the DMEPOS dataframe has a matching NPI in the LEIE dataset. Next, I need to add a column to the DMPOES dataset (maybe named "Fraudulent" that denotes whether or not that row is fraudulent, using 1 as fraudulent, and 0 as not fraudulent. Here is the code that I have written (It isn't much but it should give the general direction I'm using with Pandas.

import pandas as pd
import numpy as np

#Read files into df
dmepos = pd.read_csv('dmpoes.csv')
leie = pd. =read_csv('leie.csv')

Here are links to downloading the datasets (The NPI columns are labeled differently in each dataset, so I went in and changed it so that the column names matched, I suggest doing that too) I also changed the names of the files to make it more simple to code with : DMPOES: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/DME2018 LEIE: https://oig.hhs.gov/exclusions/exclusions_list.asp

You can use merge. It's actually cleaner IMO if you don't rename the cols because you'll have to deal with suffixes after the merge. Once you merge you can use np.where to update the Fraudulent col based upon the presence of NaN values where there two merge cols didn't have a match. Not totally sure that is the logic you wanted for the Fraudulent column, but if not, post a comment and I will update as needed.

import pandas as pd
import numpy as np

#Read files into df
dmepos = pd.read_csv('dmpoes.csv')
leie = pd.read_csv('leie.csv')

df_m  = dmepos.merge(leie, left_on='REFERRING_NPI', right_on='NPI', how='left')

df_m['Fraudulent'] = np.where(df_m['NPI'].isnull(), 1, 0)

Here we can see that rows that didn't have matches in join cols as they contain NaN values. 在这里,我们可以看到在连接列中没有匹配的行,因为它们包含 NaN 值。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM