映射多個參數

Question

我有大約2000萬個觀測值的大型數據集。 我想計算每行TitleAbstract.x1和TitleAbstract.y1之間的Jaccard索引。

這是一個2觀察樣本：

    structure(list(Patent = c(6326004L, 6514936L), TitleAbstract.x = c("mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof.", 
"mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof."
), cited = c(4261928L, 4261928L), TitleAbstract.y = c("antiviral methods using fragments human rhinovirus receptor (icam-1) ", 
"antiviral methods using human rhinovirus receptor (icam-1) method substantially inhibiting initiation spread infection rhinovirus coxsackie virus host cells expressing major human rhinovirus receptor (icam-1), comprising step contacting virus soluble polypeptide comprising hrv binding site domains ii icam-1; polypeptide capable binding virus reducing infectivity thereof; contact conditions permit virus bind polypeptide."
), Jaccard = c(0, 0.00909090909090909)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f9c8f801778>, sorted = "cited", .Names = c("Patent", 
"TitleAbstract.x", "cited", "TitleAbstract.y", "Jaccard"))

在之前的文章中，我使用了一個自制方程式來計算Jaccard索引，並創建了一個函數，然后與Mapply一起運行，但出現錯誤'this is not a function' 。

Jaccard_Index <- function(x,y)
{
  return(mapply(length(intersect(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+")))) / length(union(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+"))))))
}

mapply(Jaccard_Index,df$TitleAbstract.x1,df$TitleAbstract.y1)

我嘗試用x和y更改TitleAbstract.x1和TitleAbstract.y1 ，但仍然遇到相同的錯誤。

這可能是一個新手問題，但是有人可以幫助我編寫正確的功能嗎？

另外，我還有兩個問題：

Q2如何使用parallel＆mcapply來加快此過程？

Q3就內存存儲和速度而言，R的局限性是什么？您是否建議對長時間且占用大量內存的進程使用其他方法（即通過bash使用python）？

編輯

我已經上傳了正確的數據集，為了避免數據集被截斷，我不得不更新RStudio。

Answer 1

我簡化了您的數據集。 您可以從同名的程序包中使用stringdist() ，盡管這並不適用以單詞為單位的Jaccard索引，因此我Jaccard_Index() 。 這是使用mapply() ，但是如果要對其進行並行化，只需將其替換為mcmapply()

df <- data.frame(
Patent=1:3, 
TitleAbstract.x1=c(
"methods testing oligonucleotide arrays methods testing oligonucleotide",
"isolation cellular material microscopic visualization method microdissection",
"support method determining analyte method producing support method producing"), 
TitleAbstract.y1=c(
"support method determining analyte method producing support method producing",
"method utilizing convex geometry laser capture microdissection process",
"methods testing oligonucleotide arrays methods testing oligonucleotide"),
stringsAsFactors=FALSE)


Jaccard_Index <- function(x, y) {
    if (length(x) == 1) {
        x <- strsplit(x, "\\s+")[[1]]
    }
    if (length(y) == 1) {
        y <- strsplit(y, "\\s+")[[1]]
    }
    length(intersect(x, y)) / length(union(x, y))
}

# Appears to be that splitting the strings outside the loop is quicker
df$TitleAbstract.x1 <- strsplit(df$TitleAbstract.x1, "\\s+")
df$TitleAbstract.y1 <- strsplit(df$TitleAbstract.y1, "\\s+")

mapply(Jaccard_Index, df$TitleAbstract.x1, df$TitleAbstract.y1, USE.NAMES=FALSE)
# [1] 0.0000000 0.1538462 0.0000000

映射多個參數

問題描述

1 個解決方案

解決方案1
1 已采納 2019-07-17 10:21:38

映射多個參數

問題描述

1 個解決方案

解決方案1 1 已采納 2019-07-17 10:21:38

解決方案1
1 已采納 2019-07-17 10:21:38