so I have 2 dataframes df1 and df2.
df1
names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
df1 = pd.DataFrame({'names': names,'ages': ages})
df2
names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})
Now, I want to merge these 2 dataframes but obviously, there are 2 problems:
So how do I write a general code (not tailored for these specific dataframes) to solve this? I'm looking for outer join btw.
My guess is I could create another column in each dataframe, let's call it 'order' column which basically would be basically integers from 0 to 9. And then if I could merge dataframes based on 2 columns (I mean matching 'order1' column with 'order2' and 'names' with 'student_names'). Is that possible? I think that still throws a duplicate-related error though.
You could clean up the student names, and assign a sequence number to all repeated names (on both DFs), assuming the order is the same. In reality, you would rather use a last name and optionally some other identifier, just to make sure you are joining the right people with their grades:-)
z = df2.assign(names=df2['student_names'].str.strip())
out = (
df1
.assign(seq=df1.groupby('names').cumcount())
.merge(
z
.assign(seq=z.groupby('names').cumcount()),
on=['names', 'seq'],
how='left',
)
)
>>> out
names ages seq student_names grades
0 Bob 33 0 Bob A
1 Joe 45 0 Joe F
2 21 0 B
3 Bob 38 1 Bob F
4 0000 44 0 NaN NaN
5 Alice 10 0 Alice C
6 Joe 10 1 Joe C
7 Alice 46 1 Alice A
8 Bob 15 2 Bob B
9 48 1 NaN NaN
The setup in the question was incomplete (missing ages
and grades
), so I improvised:
names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
ages = np.random.randint(10, 50, len(names))
df1 = pd.DataFrame({'names': names,'ages': ages})
names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
grades = np.random.choice(list('ABCDF'), len(names_in_student_db))
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})
If they always match up based on index, you can just concat them together and then drop columns you no longer want.
pd.concat([df1, df2], axis=1)
Specify the column name like this
pd.merge(left=df1, right=df2, left_on='names', right_on='student_names', how='left')
depending on your expected result.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.