简体   繁体   English

在两种文件R语言中查找匹配和相似性

[英]Find match and similarity in two of files R language

I have two of large files, the contents of the files looks like: 我有两个大文件,文件内容看起来像:

df1 DF1

在此处输入图片说明

df2 DF2

在此处输入图片说明

dput of 的输出

df1 DF1

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L), .Label = c("00:00,location,long|", "00:00,location,long|00:00,location,same|", 
"00:00,location,long|00:00,runapps,com.sol.sviewcall|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,runapps,com.kakao.talk|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,wifi,dlink, iptime|00:00,wifi,dlink|", 
"00:00,location,long|00:00,wifi,dlink|", "00:00,location,long|00:00,wifi,dlink|00:00,location,same|00:00,wifi,dlink, iptime|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-183L))

df2 DF2

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), .Label = c("00:00,location,long|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, iptime|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, SK_WiFi26C4, U+zone, U+Net642B|", 
"00:00,location,long|00:00,wifi,dlink, SK_WiFi26C4|", "01:00,location,long|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,screen,OFF|01:00,runapps,com.kakao.talk|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,wifi,dlink, iptime, SK_WiFi26C4|01:00,wifi,dlink, iptime, PISnet_4D9740|01:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|01:00,runapps,com.buzzpia.aqua.launcher|01:00,screen,OFF|", 
"01:00,location,long|01:00,screen,OFF|", "02:00,location,long|02:00,wifi,dlink, iptime, SK_WiFi26C4|02:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|02:00,wifi,dlink, iptime, KT_WLAN_BBE3|02:00,runapps,com.kakao.talk|02:00,screen,OFF|", 
"02:00,location,long|02:00,wifi,dlink, iptime|02:00,runapps,com.buzzpia.aqua.launcher|02:00,runapps,com.android.mms|02:00,screen,OFF|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-232L))

My questions are: 我的问题是:

  1. I want to know the percentage of matching data of all rows, for example how many rows which has same data between df1 and df2. 我想知道所有行匹配数据的百分比,例如,在df1和df2之间有多少行具有相同数据。

  2. I want to know the percentage of similarity data of all rows, one of the data looks like "00:,location,long" I use delimiter "|" 我想知道所有行相似数据的百分比,其中一个数据看起来像“ 00:,location,long”,我使用定界符“ |” to separate one data to others. 将一种数据与其他数据分开。 In this case, if one row in df1 and df2 >= 75 % similar, I consider that rows are similar. 在这种情况下,如果df1和df2> = 75%的一行相似,那么我认为这些行是相似的。 for example the rows contains three data, and two data is same, one data is different, that is similar 例如,行包含三个数据,两个数据相同,一个数据不同,即相似

  3. I want to know the percentage of different data of all rows in df1 and df2 我想知道df1和df2中所有行的不同数据的百分比

So, I want to calculate, the percentage of matching rows (how many rows in df1 match with rows in df2), and the percentage of similarity rows (how many rows in df1 similar with rows in df2), and the percentage of different rows (how many rows in df1 different with rows in df2) 因此,我想计算匹配行的百分比(df1中有多少行与df2中的行匹配),相似性行的百分比(df1中有多少行与df2中的行相似)和不同行的百分比(df1中的行与df2中的行有多少不同)

The base data is df1, I mean I want to know how many rows which match, similar, or different of the df2 to df1 基本数据是df1,我的意思是我想知道多少行与df2匹配,相似或不同

I use R language, I've tried but I stuck. 我尝试使用R语言,但是尝试过。 Hope someone can give a light to me 希望有人可以照亮我

I Guess your question is to find the all of rows in df2 that not in df1 or all of rows in df2 that in df1. 我猜你的问题是要找到df2中所有不在df1中的行,或者找到df2中所有在df1中的行。 If that you mean, you can use sqldf library 如果您的意思是,您可以使用sqldf

library(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
df2Indf1 <- sqldf('SELECT * FROM df2 INTERSECT SELECT * FROM df1')

Another way, you can use dplyr 另一种方法,您可以使用dplyr

library(dplyr)
anti_join(df2,df1)
semi_join(df2,df1)

For the similarity, if you mean to measure the score of similarity between two strings data, you can use Levenshtein Distance see the details in this link . 对于相似性,如果您要测量两个字符串数据之间的相似性得分,则可以使用Levenshtein Distance查看此链接中的详细信息。 You can apply this to your data frame. 您可以将其应用于数据框。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM