加快以R為單位的字符串中最頻繁出現的數字的比例的計算

Question

我需要加快功能來計算重復數字的比例（忽略任何非數字）的幫助。 該功能有助於在運行任何校驗位驗證之前（如果可用的話）從用戶中識別虛假條目。 考慮假電話號碼，假學生號碼，假支票賬號，假信用卡號碼，假任何標識符，等等。

該功能是從一個概括這個職位。

這就是它的作用。 對於指定數量的最頻繁出現的數字，它會計算字符串中高位數字與所有數字的比例，而忽略所有非數字。 如果字符串中沒有數字，則返回1.0。 所有計算都是在向量字符串上完成的。

library(microbenchmark)
V = c('(12) 1221-12121,one-twoooooooooo', 'twos:22-222222222', '34-11111111, ext.123', 
        '01012', '123-456-789 valid', 'no digits', '', NaN, NA)

Fake_Similarity = function(V, TopNDigits) {
    vapply(V, function(v) {
        freq = sort(tabulate(as.integer(charToRaw(v)))[48:57], decreasing = T);
        ratio = sum(freq[1:TopNDigits], na.rm = T) / sum(freq, na.rm = T)
        if (is.nan(ratio)) ratio = 1
        ratio
    },
    double(1))
}

t(rbind(Top1Digit = Fake_Similarity(v, 1), Top2Digits = Fake_Similarity(v, 2), Top3Digits = Fake_Similarity(v, 3)))

microbenchmark(Fake_Similarity(v, 2))

與輸出。 標簽並不重要，但是順序比率必須匹配相應字符串的原始順序。

                                 Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545  1.0000000  1.0000000
twos:22-222222222                1.0000000  1.0000000  1.0000000
34-11111111, ext.123             0.6923077  0.8461538  0.9230769
01012                            0.4000000  0.8000000  1.0000000
123-456-789 valid                0.1111111  0.2222222  0.3333333
no digits                        1.0000000  1.0000000  1.0000000
                                 1.0000000  1.0000000  1.0000000
NaN                              1.0000000  1.0000000  1.0000000
<NA>                             1.0000000  1.0000000  1.0000000
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 Fake_Similarity(v, 2) 1.225418 1.283113 1.305139 1.292755 1.304262 1.769703   100

例如， twos:22-222222222有11位數字，並且所有數字都相同。 因此，對於Top1Digit我們有Top1Digit = 1，對於Top2Digits我們Top2Digits （11 + 0）/ 11 = 1，依此類推。 換句話說，無論如何，這都是偽造的數字。 假設某人的電話號碼具有相同的數字（包括區號）的可能性很小。

Answer 1

您可以使用此Rcpp函數：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double prop_top_digit(const RawVector& x, int top_n_digits) {

  // counts occurence of each character
  IntegerVector counts(256);
  RawVector::const_iterator it;
  for(it = x.begin(); it != x.end(); ++it) counts[*it]--;

  // partially sort first top_n_digits (negative -> decreasing)
  IntegerVector::iterator it2 = counts.begin() + 48, it3;
  std::partial_sort(it2, it2 + top_n_digits, it2 + 10);

  // sum the first digits
  int top = 0;
  for(it3 = it2; it3 != (it2 + top_n_digits); ++it3) top += *it3;

  // add the rest -> sum all
  int div = top;
  for(; it3 != (it2 + 10); ++it3) div += *it3;

  // return the proportion
  return div == 0 ? 1 : top / (double)div;
}

驗證：

Fake_Similarity2 <- function(V, TopNDigits) {
  vapply(V, function(v) prop_top_digit(charToRaw(v), TopNDigits), 1)
    }
t(rbind(Top1Digit = Fake_Similarity2(v, 1), 
        Top2Digits = Fake_Similarity2(v, 2), 
        Top3Digits = Fake_Similarity2(v, 3)))
                                 Top1Digit Top2Digits Top3Digits
(12) 1221-12121,one-twoooooooooo 0.5454545  1.0000000  1.0000000
twos:22-222222222                1.0000000  1.0000000  1.0000000
34-11111111, ext.123             0.6923077  0.8461538  0.9230769
01012                            0.4000000  0.8000000  1.0000000
123-456-789 valid                0.1111111  0.2222222  0.3333333
no digits                        1.0000000  1.0000000  1.0000000
                                 1.0000000  1.0000000  1.0000000
NaN                              1.0000000  1.0000000  1.0000000
<NA>                             1.0000000  1.0000000  1.0000000

基准測試：

microbenchmark(Fake_Similarity(v, 2), Fake_Similarity2(v, 2))
Unit: microseconds
                   expr     min       lq      mean   median      uq     max neval cld
  Fake_Similarity(v, 2) 298.972 306.0905 328.69384 312.5465 328.108 600.924   100   b
 Fake_Similarity2(v, 2)  25.163  27.1495  30.18863  29.1350  30.460  52.975   100  a

Answer 2

這可能無法與RCPP解決方案競爭，但是我認為它可以帶來很好的效率提升。 此實現的重點是不對每個N運行算法 ，而是一次對所有N 運行算法 。 這意味着我們只需要對每個字符串執行一次charToRaw ，而不是對每個字符串每N執行一次，類似地進行排序，制表等。然后，我們可以使用優化函數cumsum和colSums一次計算所有頻率。

library(matrixStats)
Fake_Similarity3 = function(V, N) {
    freq = vapply(V, function(v) {
        s = sort(tabulate(as.integer(charToRaw(v)))[48:57], decreasing = T)
        length(s) = 10
        return(s)
    }, FUN.VALUE = integer(10), USE.NAMES = FALSE)
    cumfreq = colCumsums(freq)
    ratio = t(cumfreq) / (colSums(freq, na.rm = T))
    ratio[!is.finite(ratio) | ratio == 0] = 1
  return(ratio[, N, drop = FALSE])
}

使用此函數，我們無需調用參數(V, 1) ， (V, 2)和(V, 3) ，而只需調用(V, 1:3)

 #           [,1]      [,2]      [,3]
 # [1,] 0.5454545 1.0000000 1.0000000
 # [2,] 1.0000000 1.0000000 1.0000000
 # [3,] 0.6923077 0.8461538 0.9230769
 # [4,] 0.4000000 0.8000000 1.0000000
 # [5,] 0.1111111 0.2222222 0.3333333
 # [6,] 1.0000000 1.0000000 1.0000000
 # [7,] 1.0000000 1.0000000 1.0000000
 # [8,] 1.0000000 1.0000000 1.0000000
 # [9,] 1.0000000 1.0000000 1.0000000


microbenchmark::microbenchmark(
    FS1 = t(rbind(Top1Digit = Fake_Similarity(V, 1), Top2Digits = Fake_Similarity(V, 2), Top3Digits = Fake_Similarity(V, 3))),
    FS3 = Fake_Similarity3(V, 1:3)
)

# Unit: microseconds
#  expr     min      lq     mean   median        uq      max neval cld
#   FS1 896.336 958.490 1103.260 1011.800 1145.0125 2494.136   100   b
#   FS3 311.798 336.853  399.983  358.979  408.0855  886.013   100  a

因此，前1位，2位和3位數字的速度比原始速度快3倍。 相對於原始數字，使用的最高位數越多，效果越好。

加快以R為單位的字符串中最頻繁出現的數字的比例的計算

問題描述

2 個解決方案

解決方案1
3 已采納 2017-12-05 22:52:07

解決方案2
1 2017-12-05 23:43:32

加快以R為單位的字符串中最頻繁出現的數字的比例的計算

問題描述

2 個解決方案

解決方案1 3 已采納 2017-12-05 22:52:07

解決方案2 1 2017-12-05 23:43:32

解決方案1
3 已采納 2017-12-05 22:52:07

解決方案2
1 2017-12-05 23:43:32