简体   繁体   中英

Merge different length dataframes, Join column in dataframe dont have unique values

I have Titanic dataset with data in different csv files. I need to combined all the files in one dataframe to use the data. But one of file dont not the any column which posses unique values. I am trying to merge the data using merge command but number of records increases.

enter code here

Df1

    Ticket  Fare    Cabin   Embarked
0   110152  86.50   B79       S
1   110152  92.50   B77       S
2   110413  79.65   E67       S
3   110413  79.65   E68       S
4   110465  52.00   C110      S
5   110465  52.00   A14       S
6   110564  26.55   C52       S
7   110813  75.25   D37       C
8   111240  33.50   B19       S
9   111320  38.50   E63       S

df2 

        Survived    Ticket
PassengerId     
1         0         A/5 21171
2         1         PC 17599
3         1         STON/O2. 3101282
4         1         113803
5         0         373450
6         0         330877
7         0         17463
8         0         349909
9         1         347742
10        1         237736

There are some tickets which are having different prices for the same ticket number. Which is adding two records for same ticket number for that passenger for the different price.

eg. Ticket 110152 is having two prices. whichever customer buys this ticket is having two records after the merge with two different prices.

 pass
engerID   Survived  Ticket   Fare     Cabin  Embarked
 0    0       110152        86.50      NaN      S
 0    1       110152        90.50      C85      C
 1    1     STON/O2.3101   7.9250      NaN      S
 2    1      113803        53.1000     C123     S
 3    0      113803        53.1000     C123     S
 4    0       373450       8.0500       NaN     S

Here passenger 0 is having to records with different prices but it should have only one record after merge.

If I understand correctly, the issue is with multiple records coming after the merge statement.

You can eliminate multiple records for the same Ticket number and keep only 1 record. Something like this:

In [298]: df1['rank'] = df1.groupby('Ticket')['Fare'].rank('first',ascending=False)

In [299]: df1
Out[299]: 
   Ticket   Fare Cabin Embarked  rank
0  110152  86.50   B79        S   2.0
1  110152  92.50   B77        S   1.0
2  110413  79.65   E67        S   1.0
3  110413  79.65   E68        S   2.0
4  110465  52.00  C110        S   1.0
5  110465  52.00   A14        S   2.0
6  110564  26.55   C52        S   1.0
7  110813  75.25   D37        C   1.0
8  111240  33.50   B19        S   1.0
9  111320  38.50   E63        S   1.0

In [303]: df1 = df1.query('rank == 1.0').drop('rank',1)

In [304]: df1
Out[304]: 

   Ticket   Fare Cabin Embarked
1  110152  92.50   B77        S
2  110413  79.65   E67        S
4  110465  52.00  C110        S
6  110564  26.55   C52        S
7  110813  75.25   D37        C
8  111240  33.50   B19        S
9  111320  38.50   E63        S

Now, if you see , df1 has only 1 record per ticket number. Now, you merge statement will not produce duplicates.

Let me know if this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM