简体   繁体   中英

How to fix Join between two datasets

Problem : when I try to merge or join two datasets, setting the same index, it generates a datasets with duplicates.

Create the first dataframe (UNI):

import csv
import pandas as pd
import os
import os.path

fullName=os.getcwd()
full_filename = os.path.join(fullName,'Rankings.csv')
file_stream = open(full_filename, mode='r', newline='')

reader = csv.reader(file_stream, delimiter=",")

# read and ignore the first line
header = next(reader)
data = []
# read the remaining part of the file
for i in range(2000):
info = next(reader)
data += [info]
file_stream.close()

dfUNI = pd.DataFrame(data)
dfUNI.columns = header
#I Renamed column 1 to be able to merge the two datasets with the same "Name" column
cols = dfUNI.columns.get_values()
cols[1] = 'Name'
dfUNI.columns = cols

Create the second dataframe (Fees):

full_filename = os.path.join(fullName,'Fees.csv')
file_stream = open(full_filename, mode='r',      newline='',encoding="ISO-8859-1");
#I used encoding to remove reading problems
reader = csv.reader(file_stream, delimiter=",")
# read and ignore the first line
header = next(reader)
data = []
# read the remaining part of the file
for i in range(200):
    info = next(reader)
    data += [info]
file_stream.close()

dfFees = pd.DataFrame(data)
dfFees.columns = header
del dfUNI["international"]
del dfUNI["income"]
del dfUNI["female_male_ratio"]
del dfUNI["student_staff_ratio"]
del dfUNI["year"]
dfUNI.set_index("Name")
dfFees.set_index("Name")
dfFees

Join them together:

df=dfUNI.set_index("Name")
df2=dfFees.set_index("Name")
df.join(df2,how="outer")

I expected a dataset with the information from the dfFees / df2 " (second) dataset added in the correct rows (by "Name" ) to the dfUNI / df (first) dataset.

First things first, since you're using pandas , you may want to simplify the way you're reading in those csv's using pd.read_csv ( documentation here ) (You could also use pathlib.Path ( doc ) for easier path manipulation, but I focused on pandas ):

# Starting from scratch:

import csv
import pandas as pd
import os
import os.path

fullName=os.getcwd()
full_filename_UNI = os.path.join(fullName, "Rankings.csv")
full_filename_Fees = os.path.join(fullName, "Fees.csv")

dfUNI  = pd.read_csv(full_filename_UNI, delimiter=",")
dfFees = pd.read_csv(full_filename_UNI, delimiter=",", encoding="ISO-8859-1")

Then you can use .rename ( doc ) to rename that column and .drop ( doc ) instead of del dfUNI["something"] . Don't forget the " inplace " argument for either so that you don't have to redefine the variable every time like dfUNI = dfUNI.replace(...) .

# Start of cleanup for dfUNI ->
dfUNI.rename(index=str, columns={dfUNI.columns[0]: "Name"}, inplace=True)

# Start of cleanup for dfFee ->
colNameDropList = ["international", "income", "female_male_ratio", "student_staff_ratio", "year"]
dfFees.drop(columns=colNameDropList, inplace=True)

# Set the index for both (use inplace!):
dfUNI.set_index("Name", inplace=True)
dfFees.set_index("Name", inplace=True)

Now comes the part that you're really looking for: you need to use a left join . Pandas uses a lot of SQL-esk methods for their dataframes.

dfFINAL = dfUNI.join(dfFees, how="left") # "left" is the default btw

OR , instead of setting the indices beforehand, you could use the " on " argument of the .join method:

dfFINAL = dfUNI.join(dfFees, how="left", on="Name")

You were getting duplicates because you were doing an "outer join", which throws the data together and doesn't miss any. (Check this out.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM