简体   繁体   English

在浮点值列上合并 pandas DataFrame

[英]Merge pandas DataFrame on column of float values

I have two data frames that I am trying to merge.我有两个要合并的数据框。

Dataframe A: Dataframe A:

    col1    col2    sub    grade
0   1       34.32   x       a 
1   1       34.32   x       b
2   1       34.33   y       c
3   2       10.14   z       b
4   3       33.01   z       a

Dataframe B: Dataframe B:

    col1    col2    group   ID
0   1       34.32   t       z 
1   1       54.32   s       w
2   1       34.33   r       z
3   2       10.14   q       z
4   3       33.01   q       e

I want to merge on col1 and col2.我想合并 col1 和 col2。 I've been pd.merge with the following syntax:我一直 pd.merge 使用以下语法:

pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])

However, I think I am running into issues joining on the float values of col2 since many rows are being dropped.但是,我认为我在加入 col2 的浮点值时遇到了问题,因为很多行都被删除了。 Is there any way to use np.isclose to match the values of col2?有没有办法使用 np.isclose 来匹配 col2 的值? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.当我在 dataframe 中引用 col2 的特定值的索引时,该值比 dataframe 中显示的小数位数多得多。

I would like the result to be:我希望结果是:

    col1   col2   sub   grade   group    ID
0   1      34.32  x     a       t        z
1   1      34.32  x     b       s        w
2   1      54.32  s     w       NaN      NaN
3   1      34.33  y     c       r        z
4   2      10.14  z     b       q        z
5   3      33.01  z     a       q        e

You can use a little hack - multiple float columns by some constant like 100 , 1000 ..., convert column to int , merge and last divide by constant: 通过像一些固定倍数浮动列-您可以使用一个小巧的黑客工具1001000 ......,转换列intmerge :通过不断与去年鸿沟

N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int) 
B.col2 = np.round(B.col2*N).astype(int) 
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
   col1   col2  sub grade group ID
0     1  34.32    x     a     t  z
1     1  34.32    x     b     t  z
2     1  34.33    y     c     r  z
3     2  10.14    z     b     q  z
4     3  33.01    z     a     q  e
5     1  54.32  NaN   NaN     s  w

I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. 我遇到了类似的问题,我需要识别具有数千个浮点列且没有标识符的匹配行。 This case is difficult because values can vary slightly due to rounding. 这种情况很难,因为由于四舍五入,值会略有不同。

In this case, I used scipy.spatial.distance.cosine to get the cosine similarity between rows. 在这种情况下,我使用scipy.spatial.distance.cosine来获取行之间的余弦相似度。

from scipy import distance

threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)

if similarity >= threshold:
    # it's a match
else:
    # loop and check another row pair

This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well. 如果您有重复或非常相似的行,但是当您有大量的浮动列而不是太多行时,这将无法正常工作。

Assuming that the column (col2) has n decimal numbers.假设列 (col2) 有n 个十进制数。

A.col2 = np.round(A.col2, decimals=n)
B.col2 = np.round(B.col2, decimals=n)
df = A.merge(B, left_on=['col1', 'col2'], right_on=['col1', 'col2'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM