简体   繁体   中英

merge.ffdf incorrect result when multiple columns match in R

I have an ff dataframe windows_ff:

    edge      ipaddr        port    protocol windowed_qd       class
1 1182430570  41.2.194.42 1299        1           0           WEB        
2 1182430570  41.2.194.42 1302        1           0           WEB        

I want to find a mutual relation among its rows, so I decided to make an exact copy of that dataframe:

outgoing_windows_ff_1 <- ffdf(edge=outgoing_windows_ff$edge, 
                                      ipaddr=outgoing_windows_ff$ipaddr,
                                      influencing_port=outgoing_windows_ff$port,
                                      influencing_proto=outgoing_windows_ff$proto,
                                      influencing_class=outgoing_windows_ff$class)

and then merge the 2 dataframes:

merged <- merge(x=outgoing_windows_ff, y=outgoing_windows_ff_1, 
                        by.x=c('edge','ipaddr'),by.y=c('edge','ipaddr') )

The result is:

  edge      ipaddr        port    protocol windowed_qd       class influencing_port
1 1182430570  41.2.194.42 1299        1           0           WEB              1299       
2 1182430570  41.2.194.42 1302        1           0           WEB              1299       

but it is WRONG, because I would expect 4 rows in the result.

Doing the merge between normal dataframes:

merged <- merge(x=as.data.frame(outgoing_windows_ff), 
                        y=as.data.frame(outgoing_windows_ff_1), 
                        by.x=c('edge','ipaddr'),by.y=c('edge','ipaddr') )

I get the correct result:

        edge      ipaddr port protocol windowed_qd class influencing_port influencing_proto
1 1182430570 41.2.194.42 1299        1           0   WEB             1299                 1
2 1182430570 41.2.194.42 1299        1           0   WEB             1302                 1
3 1182430570 41.2.194.42 1302        1           0   WEB             1299                 1
4 1182430570 41.2.194.42 1302        1           0   WEB             1302                 1

I think that is really DANGEROUS that a certain operation gives 2 different results if ff dataframes or "normal dataframes" are used. This can lead to poisoned results and the experimenter cannot know about it. My doubt is: "maybe other results that I obtained with ff package are poisoned and I didn't realize"

Have your read the documentation of merge.ffdf from package ffbase, which is the function you are using?

It says:

This method is similar as merge in the base package but only allows inner and left outer joins . Mark that joining is done based on ffmatch or ffdfmatch, meaning that only the * first * element in y will be added to x and ffdfmatch works on link[base]{ paste }-ing together a key. So this might not be suited if your key contains columns of vmode double.

Mark what is highlighted in bold. What you are doing with merge.ffdf is a full outer join which is not supported by merge.ffdf . Mark the word 'first' in the documentation. Also mark that it paste 's together a key.

If you are in need of code which performs a full outer join, feel free to push code which does a full outer join which works on ff objects on the github repository of ffbase: https://github.com/edwindj/ffbase

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM