[英]Merge different length dataframes, Join column in dataframe dont have unique values
I have Titanic dataset with data in different csv files. 我有泰坦尼克号数据集,其中数据包含在不同的csv文件中。 I need to combined all the files in one dataframe to use the data.
我需要将所有文件合并到一个数据框中以使用数据。 But one of file dont not the any column which posses unique values.
但是文件之一不是拥有唯一值的任何列。 I am trying to merge the data using merge command but number of records increases.
我正在尝试使用merge命令合并数据,但是记录数却增加了。
enter code here
Df1 DF1
Ticket Fare Cabin Embarked
0 110152 86.50 B79 S
1 110152 92.50 B77 S
2 110413 79.65 E67 S
3 110413 79.65 E68 S
4 110465 52.00 C110 S
5 110465 52.00 A14 S
6 110564 26.55 C52 S
7 110813 75.25 D37 C
8 111240 33.50 B19 S
9 111320 38.50 E63 S
df2
Survived Ticket
PassengerId
1 0 A/5 21171
2 1 PC 17599
3 1 STON/O2. 3101282
4 1 113803
5 0 373450
6 0 330877
7 0 17463
8 0 349909
9 1 347742
10 1 237736
There are some tickets which are having different prices for the same ticket number. 对于相同的票号,有些票的价格不同。 Which is adding two records for same ticket number for that passenger for the different price.
这将为该乘客以不同的价格添加两个相同机票号的记录。
eg. 例如。 Ticket 110152 is having two prices.
机票110152有两个价格。 whichever customer buys this ticket is having two records after the merge with two different prices.
购入该票的任何客户在合并两个不同的价格后都有两个记录。
pass
engerID Survived Ticket Fare Cabin Embarked
0 0 110152 86.50 NaN S
0 1 110152 90.50 C85 C
1 1 STON/O2.3101 7.9250 NaN S
2 1 113803 53.1000 C123 S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Here passenger 0 is having to records with different prices but it should have only one record after merge. 在这里,旅客0必须以不同的价格记录,但合并后应该只有一个记录。
If I understand correctly, the issue is with multiple records coming after the merge statement. 如果我理解正确,那么问题在于合并语句之后有多个记录。
You can eliminate multiple records for the same Ticket number and keep only 1 record. 您可以消除同一票证号的多个记录,并仅保留1条记录。 Something like this:
像这样:
In [298]: df1['rank'] = df1.groupby('Ticket')['Fare'].rank('first',ascending=False)
In [299]: df1
Out[299]:
Ticket Fare Cabin Embarked rank
0 110152 86.50 B79 S 2.0
1 110152 92.50 B77 S 1.0
2 110413 79.65 E67 S 1.0
3 110413 79.65 E68 S 2.0
4 110465 52.00 C110 S 1.0
5 110465 52.00 A14 S 2.0
6 110564 26.55 C52 S 1.0
7 110813 75.25 D37 C 1.0
8 111240 33.50 B19 S 1.0
9 111320 38.50 E63 S 1.0
In [303]: df1 = df1.query('rank == 1.0').drop('rank',1)
In [304]: df1
Out[304]:
Ticket Fare Cabin Embarked
1 110152 92.50 B77 S
2 110413 79.65 E67 S
4 110465 52.00 C110 S
6 110564 26.55 C52 S
7 110813 75.25 D37 C
8 111240 33.50 B19 S
9 111320 38.50 E63 S
Now, if you see , df1
has only 1 record per ticket number. 现在,如果看到,则
df1
每个票证号码只有1条记录。 Now, you merge
statement will not produce duplicates. 现在,您的
merge
语句将不会产生重复项。
Let me know if this helps. 让我知道是否有帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.