合并不同长度的数据框，数据框中的Join列没有唯一值

Question

I have Titanic dataset with data in different csv files. 我有泰坦尼克号数据集，其中数据包含在不同的csv文件中。 I need to combined all the files in one dataframe to use the data. 我需要将所有文件合并到一个数据框中以使用数据。 But one of file dont not the any column which posses unique values. 但是文件之一不是拥有唯一值的任何列。 I am trying to merge the data using merge command but number of records increases. 我正在尝试使用merge命令合并数据，但是记录数却增加了。

enter code here

Df1 DF1

    Ticket  Fare    Cabin   Embarked
0   110152  86.50   B79       S
1   110152  92.50   B77       S
2   110413  79.65   E67       S
3   110413  79.65   E68       S
4   110465  52.00   C110      S
5   110465  52.00   A14       S
6   110564  26.55   C52       S
7   110813  75.25   D37       C
8   111240  33.50   B19       S
9   111320  38.50   E63       S

df2 

        Survived    Ticket
PassengerId     
1         0         A/5 21171
2         1         PC 17599
3         1         STON/O2. 3101282
4         1         113803
5         0         373450
6         0         330877
7         0         17463
8         0         349909
9         1         347742
10        1         237736

There are some tickets which are having different prices for the same ticket number. 对于相同的票号，有些票的价格不同。 Which is adding two records for same ticket number for that passenger for the different price. 这将为该乘客以不同的价格添加两个相同机票号的记录。

eg. 例如。 Ticket 110152 is having two prices. 机票110152有两个价格。 whichever customer buys this ticket is having two records after the merge with two different prices. 购入该票的任何客户在合并两个不同的价格后都有两个记录。

 pass
engerID   Survived  Ticket   Fare     Cabin  Embarked
 0    0       110152        86.50      NaN      S
 0    1       110152        90.50      C85      C
 1    1     STON/O2.3101   7.9250      NaN      S
 2    1      113803        53.1000     C123     S
 3    0      113803        53.1000     C123     S
 4    0       373450       8.0500       NaN     S

Here passenger 0 is having to records with different prices but it should have only one record after merge. 在这里，旅客0必须以不同的价格记录，但合并后应该只有一个记录。

Answer 1

If I understand correctly, the issue is with multiple records coming after the merge statement. 如果我理解正确，那么问题在于合并语句之后有多个记录。

You can eliminate multiple records for the same Ticket number and keep only 1 record. 您可以消除同一票证号的多个记录，并仅保留1条记录。 Something like this: 像这样：

In [298]: df1['rank'] = df1.groupby('Ticket')['Fare'].rank('first',ascending=False)

In [299]: df1
Out[299]: 
   Ticket   Fare Cabin Embarked  rank
0  110152  86.50   B79        S   2.0
1  110152  92.50   B77        S   1.0
2  110413  79.65   E67        S   1.0
3  110413  79.65   E68        S   2.0
4  110465  52.00  C110        S   1.0
5  110465  52.00   A14        S   2.0
6  110564  26.55   C52        S   1.0
7  110813  75.25   D37        C   1.0
8  111240  33.50   B19        S   1.0
9  111320  38.50   E63        S   1.0

In [303]: df1 = df1.query('rank == 1.0').drop('rank',1)

In [304]: df1
Out[304]: 

   Ticket   Fare Cabin Embarked
1  110152  92.50   B77        S
2  110413  79.65   E67        S
4  110465  52.00  C110        S
6  110564  26.55   C52        S
7  110813  75.25   D37        C
8  111240  33.50   B19        S
9  111320  38.50   E63        S

Now, if you see , df1 has only 1 record per ticket number. 现在，如果看到，则df1每个票证号码只有1条记录。 Now, you merge statement will not produce duplicates. 现在，您的merge语句将不会产生重复项。

Let me know if this helps. 让我知道是否有帮助。

合并不同长度的数据框，数据框中的Join列没有唯一值

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-11-25 11:58:25

合并不同长度的数据框，数据框中的Join列没有唯一值

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-11-25 11:58:25

解决方案1
1 已采纳 2018-11-25 11:58:25