简体   繁体   English

合并不同长度的数据框,数据框中的Join列没有唯一值

[英]Merge different length dataframes, Join column in dataframe dont have unique values

I have Titanic dataset with data in different csv files. 我有泰坦尼克号数据集,其中数据包含在不同的csv文件中。 I need to combined all the files in one dataframe to use the data. 我需要将所有文件合并到一个数据框中以使用数据。 But one of file dont not the any column which posses unique values. 但是文件之一不是拥有唯一值的任何列。 I am trying to merge the data using merge command but number of records increases. 我正在尝试使用merge命令合并数据,但是记录数却增加了。

enter code here

Df1 DF1

    Ticket  Fare    Cabin   Embarked
0   110152  86.50   B79       S
1   110152  92.50   B77       S
2   110413  79.65   E67       S
3   110413  79.65   E68       S
4   110465  52.00   C110      S
5   110465  52.00   A14       S
6   110564  26.55   C52       S
7   110813  75.25   D37       C
8   111240  33.50   B19       S
9   111320  38.50   E63       S

df2 

        Survived    Ticket
PassengerId     
1         0         A/5 21171
2         1         PC 17599
3         1         STON/O2. 3101282
4         1         113803
5         0         373450
6         0         330877
7         0         17463
8         0         349909
9         1         347742
10        1         237736

There are some tickets which are having different prices for the same ticket number. 对于相同的票号,有些票的价格不同。 Which is adding two records for same ticket number for that passenger for the different price. 这将为该乘客以不同的价格添加两个相同机票号的记录。

eg. 例如。 Ticket 110152 is having two prices. 机票110152有两个价格。 whichever customer buys this ticket is having two records after the merge with two different prices. 购入该票的任何客户在合并两个不同的价格后都有两个记录。

 pass
engerID   Survived  Ticket   Fare     Cabin  Embarked
 0    0       110152        86.50      NaN      S
 0    1       110152        90.50      C85      C
 1    1     STON/O2.3101   7.9250      NaN      S
 2    1      113803        53.1000     C123     S
 3    0      113803        53.1000     C123     S
 4    0       373450       8.0500       NaN     S

Here passenger 0 is having to records with different prices but it should have only one record after merge. 在这里,旅客0必须以不同的价格记录,但合并后应该只有一个记录。

If I understand correctly, the issue is with multiple records coming after the merge statement. 如果我理解正确,那么问题在于合并语句之后有多个记录。

You can eliminate multiple records for the same Ticket number and keep only 1 record. 您可以消除同一票证号的多个记录,并仅保留1条记录。 Something like this: 像这样:

In [298]: df1['rank'] = df1.groupby('Ticket')['Fare'].rank('first',ascending=False)

In [299]: df1
Out[299]: 
   Ticket   Fare Cabin Embarked  rank
0  110152  86.50   B79        S   2.0
1  110152  92.50   B77        S   1.0
2  110413  79.65   E67        S   1.0
3  110413  79.65   E68        S   2.0
4  110465  52.00  C110        S   1.0
5  110465  52.00   A14        S   2.0
6  110564  26.55   C52        S   1.0
7  110813  75.25   D37        C   1.0
8  111240  33.50   B19        S   1.0
9  111320  38.50   E63        S   1.0

In [303]: df1 = df1.query('rank == 1.0').drop('rank',1)

In [304]: df1
Out[304]: 

   Ticket   Fare Cabin Embarked
1  110152  92.50   B77        S
2  110413  79.65   E67        S
4  110465  52.00  C110        S
6  110564  26.55   C52        S
7  110813  75.25   D37        C
8  111240  33.50   B19        S
9  111320  38.50   E63        S

Now, if you see , df1 has only 1 record per ticket number. 现在,如果看到,则df1每个票证号码只有1条记录。 Now, you merge statement will not produce duplicates. 现在,您的merge语句将不会产生重复项。

Let me know if this helps. 让我知道是否有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较具有不同长度的非唯一索引的数据帧的列值 - compare column values of dataframes with non-unique indices of different length 合并具有不同行数的 DataFrame,并具有一个包含值总和的新列 - Merge DataFrames with different number of rows and have a new column with sum of values Pandas 在不同长度的列上合并两个数据帧 - Pandas merge two dataframes on column with different length 加入数据框并合并/替换列值 - Join dataframes and merge/replace column values 按 dataframe 的列值合并两个数据帧 - Merge two dataframes groupby the column values of a dataframe 合并列值不同的不同数据框 - Merge different dataframes having difference in column values 合并具有不同长度的多个数据框中的特定列 - Merge specific column in multiple dataframe with different length 如何根据一列中的唯一值将熊猫数据帧划分为不同的数据帧并对其进行迭代? - how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that? 熊猫:在列值重复的列上联接或合并多个数据框 - Pandas: Join or merge multiple dataframes on a column where column values are repeating 合并两个 pandas 数据帧,它们在要合并的列上的值略有不同 - Merge two pandas dataframes that have slightly different values on the column which is being merged
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM