I have Titanic dataset with data in different csv files. I need to combined all the files in one dataframe to use the data. But one of file dont not the any column which posses unique values. I am trying to merge the data using merge command but number of records increases.
enter code here
Df1
Ticket Fare Cabin Embarked
0 110152 86.50 B79 S
1 110152 92.50 B77 S
2 110413 79.65 E67 S
3 110413 79.65 E68 S
4 110465 52.00 C110 S
5 110465 52.00 A14 S
6 110564 26.55 C52 S
7 110813 75.25 D37 C
8 111240 33.50 B19 S
9 111320 38.50 E63 S
df2
Survived Ticket
PassengerId
1 0 A/5 21171
2 1 PC 17599
3 1 STON/O2. 3101282
4 1 113803
5 0 373450
6 0 330877
7 0 17463
8 0 349909
9 1 347742
10 1 237736
There are some tickets which are having different prices for the same ticket number. Which is adding two records for same ticket number for that passenger for the different price.
eg. Ticket 110152 is having two prices. whichever customer buys this ticket is having two records after the merge with two different prices.
pass
engerID Survived Ticket Fare Cabin Embarked
0 0 110152 86.50 NaN S
0 1 110152 90.50 C85 C
1 1 STON/O2.3101 7.9250 NaN S
2 1 113803 53.1000 C123 S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Here passenger 0 is having to records with different prices but it should have only one record after merge.
If I understand correctly, the issue is with multiple records coming after the merge statement.
You can eliminate multiple records for the same Ticket number and keep only 1 record. Something like this:
In [298]: df1['rank'] = df1.groupby('Ticket')['Fare'].rank('first',ascending=False)
In [299]: df1
Out[299]:
Ticket Fare Cabin Embarked rank
0 110152 86.50 B79 S 2.0
1 110152 92.50 B77 S 1.0
2 110413 79.65 E67 S 1.0
3 110413 79.65 E68 S 2.0
4 110465 52.00 C110 S 1.0
5 110465 52.00 A14 S 2.0
6 110564 26.55 C52 S 1.0
7 110813 75.25 D37 C 1.0
8 111240 33.50 B19 S 1.0
9 111320 38.50 E63 S 1.0
In [303]: df1 = df1.query('rank == 1.0').drop('rank',1)
In [304]: df1
Out[304]:
Ticket Fare Cabin Embarked
1 110152 92.50 B77 S
2 110413 79.65 E67 S
4 110465 52.00 C110 S
6 110564 26.55 C52 S
7 110813 75.25 D37 C
8 111240 33.50 B19 S
9 111320 38.50 E63 S
Now, if you see , df1
has only 1 record per ticket number. Now, you merge
statement will not produce duplicates.
Let me know if this helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.