按列分組數據幀並按n個元素進行匹配

Question

所以這是我的問題。 我有兩個數據框。 下面是它們的簡化版本。

df1
ID         String
1.1        a
1.1        a
1.1        b
1.1        c
...
1.2        a 
1.2        a
1.2        c
1.2        c
...
2.1        a
2.1        n
2.1        o
2.1        o
...
2.2        a
2.2        n
2.2        n
2.2        o
...
3.1        a
3.1        a
3.1        x
3.1        x
...
3.2        a
3.2        x
3.2        a
3.2        x
...
4.1        a
4.1        b
4.1        o
4.1        o
... 
4.2        a
4.2        b
4.2        b
4.2        o

想象每個ID（例如：1.1）有1000多個行。 要注意的另一件事是，在具有相同編號（例如：1.1和1.2）的ID的情況下，它們非常相似。 但不是彼此完全匹配。

df2
string2
a
b
a
c

df2是測試df。

我想查看哪個df1 ID最接近df2。 但是我有一個非常重要的條件。 我想按n個元素進行匹配。 並非整個數據框都與另一個相對。

我為此的偽代碼：

df2-elements-to-match <- df2$string2[1:n] #only the first n elements

group df1 by ID

df1-elements-to-match <- df1$String[1:n of every ID] #only the first n elements of each ID

Output a column with score of how many matches. 

Filter df1 to remove ID groups with < m score. #m here could be any number. 

Filtered df1 becomes new df1. 

n <- n+1 

df2-elements-to-match and df1-elements-to-match both slide down to the next n elements. Overlap is optional. (ex: if first was 1:2, then 3:4 or even 2:3 and then 3:4)

Reiterate loop with updated variables

If one ID remains stop loop.

這里的想法是獲得預測的匹配，而不必匹配整個測試數據幀。

Answer 1

## minimal dfs
df1 <- data.frame(ID=c(rep(1.1, 5),
                       rep(1.2, 6),
                       rep(1.3, 3)),
                  str=unlist(strsplit("aabaaaabcababc", "")), stringsAsFactors=F)

df2 <- data.frame(str=c("a", "b", "a", "b"), stringsAsFactors=F)


## functions

distance <- function(df, query.df, df.col, query.df.col) {
  deviating <- df[, df.col] != query.df[, query.df.col]
    sum(deviating, na.rm=TRUE) # if too few rows, there will be NA, ignore NA
}

distances <- function(dfs, query.df, dfs.col, query.df.col) {
  sapply(dfs, function(df) distance(df, query.df, dfs.col, query.df.col))
}

orderedDistances <- function(dfs, query.df, dfs.col, query.df.col) {
  dists <- distances(dfs, query.df, dfs.col, query.df.col)
  sort(dists)
}

orderByDistance <- function(dfs, query.df, dfs.col, query.df.col, dfs.split.col) {
  dfs.split <- split(dfs, dfs[, dfs.split.col])
  dfs.split.N <- lapply(dfs.split, function(df) df[1:nrow(query.df), ])
  orderedDistances(dfs.split.N, query.df, dfs.col, query.df.col)
}


orderByDistance(df1, df2, "str", "str", "ID")
# 1.3 1.1 1.2 
#   1   3   3 

# 1.3 is the closest to df2!

您的問題有點像距離問題。 最小化距離=找到最相似的序列。

我在這里顯示的這種距離假設在df2和df1的sub-df之間的等價位置，偏差計算為1 ，等式計算為0 。 該總和給出了比較的數據幀之間的非unsimilarity-score -字符串序列。

orderByDistance采用dfs （df1）和查詢df（df2），以及應進行比較的列以及應將其拆分dfs的列（此處為“ ID”）。 首先，它拆分dfs ，然后收集每個子df的N行（用於比較的准備），然后對每個sub.df應用orderedDistances ，並確保N行（N =數字或查詢df的行）。

按列分組數據幀並按n個元素進行匹配

問題描述

1 個解決方案

解決方案1
0 2018-10-18 09:13:08

按列分組數據幀並按n個元素進行匹配

問題描述

1 個解決方案

解決方案1 0 2018-10-18 09:13:08

解決方案1
0 2018-10-18 09:13:08