更快的替代 file.exists()

Question

我维护一个 R 包，需要单独检查大量小文件的存在。 重复调用file.exists()产生明显的缓慢（此处为基准测试结果）。 不幸的是，情境限制阻止我以矢量化方式对整批文件调用一次file.exists() ，我相信这会快得多。 有没有更快的方法来检查单个文件是否存在？ 也许在C？ 这种方式在我的系统上似乎并没有更快（产生这些基准的同一个）：

library(inline)
library(microbenchmark)

body <- "
  FILE *fp = fopen(CHAR(STRING_ELT(r_path, 0)), \"r\");
  SEXP result = PROTECT(allocVector(INTSXP, 1));
  INTEGER(result)[0] = fp == NULL? 0 : 1;
  UNPROTECT(1);
  return result;
"

file_exists_c <- cfunction(sig = signature(r_path = "character"), body = body)

tmp <- tempfile()

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr   min     lq    mean median     uq    max neval
#>     c 4.912 5.0230 5.42443 5.0605 5.1240 25.264   100
#>     r 3.972 4.0525 4.32615 4.1835 4.2675 11.750   100

file.create(tmp)
#> [1] TRUE

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr    min      lq     mean  median      uq    max neval
#>     c 16.212 16.6245 17.04727 16.7645 16.9860 32.207   100
#>     r  6.242  6.4175  7.16057  7.2830  7.4605 26.781   100

^{由reprex 包(v0.3.0) 于 2019 年 12 月 6 日创建}

编辑： `access()`

access()似乎确实更快，但不是很多。

library(inline)
library(microbenchmark)

body <- "
  SEXP result = PROTECT(allocVector(INTSXP, 1));
  INTEGER(result)[0] = access(CHAR(STRING_ELT(r_path, 0)), 0)? 0 : 1;
  UNPROTECT(1);
  return result;
"

file_exists_c <- cfunction(
  sig = signature(r_path = "character"),
  body = body,
  includes = "#include <unistd.h>"
)

tmp <- tempfile()

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr   min    lq    mean median     uq    max neval
#>     c 1.033 1.048 1.21334 1.0745 1.0910 13.793   100
#>     r 1.051 1.068 1.19280 1.0930 1.1175 10.048   100

file.create(tmp)
#> [1] TRUE

microbenchmark(
  c = file_exists_c(tmp),
  r = file.exists(tmp)
)
#> Unit: microseconds
#>  expr   min     lq    mean median     uq    max neval
#>     c 1.073 1.0910 1.33543 1.1285 1.1500 16.676   100
#>     r 1.172 1.1965 1.32934 1.2335 1.2695  9.916   100

^{由reprex 包(v0.3.0) 于 2019 年 12 月 7 日创建}

Answer 1

这是整个file.exists源代码（在撰写本文时）：

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/platform.c#L1375-L1404

SEXP attribute_hidden do_fileexists(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    SEXP file, ans;
    int i, nfile;
    checkArity(op, args);
    if (!isString(file = CAR(args)))
    error(_("invalid '%s' argument"), "file");
    nfile = LENGTH(file);
    ans = PROTECT(allocVector(LGLSXP, nfile));
    for (i = 0; i < nfile; i++) {
    LOGICAL(ans)[i] = 0;
    if (STRING_ELT(file, i) != NA_STRING) {
#ifdef Win32
        /* Package XML sends arbitrarily long strings to file.exists! */
        size_t len = strlen(CHAR(STRING_ELT(file, i)));
        if (len > MAX_PATH)
        LOGICAL(ans)[i] = FALSE;
        else
        LOGICAL(ans)[i] =
            R_WFileExists(filenameToWchar(STRING_ELT(file, i), TRUE));
#else
        // returns NULL if not translatable
        const char *p = translateCharFP2(STRING_ELT(file, i));
        LOGICAL(ans)[i] = p && R_FileExists(p);
#endif
    } else LOGICAL(ans)[i] = FALSE;
    }
    UNPROTECT(1); /* ans */
    return ans;
}

至于R_FileExists ，它在这里：

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/sysutils.c#L60-L79

#ifdef Win32
Rboolean R_FileExists(const char *path)
{
    struct _stati64 sb;
    return _stati64(R_ExpandFileName(path), &sb) == 0;
}
#else
Rboolean R_FileExists(const char *path)
{
    struct stat sb;
    return stat(R_ExpandFileName(path), &sb) == 0;
}

（ R_ExpandFileName只是在做path.expand ）。 它依赖于stat系统实用程序：

https://en.wikipedia.org/wiki/Stat_(system_call)

https://pubs.opengroup.org/onlinepubs/007908799/xsh/sysstat.h.html

它是为矢量化输入而构建的，因此如前所述，执行file.exists(vector_of_files)比重复运行file.exists(single_file) 。

据我所知（不可否认，我不是这里系统实用程序的专家），任何效率提升都以稳健性为代价。

Answer 2

C 中的一个简单解决方案是使用 access( name of file , 0); 如果函数返回 0，则文件存在。 第二个参数 0 指定只检查它是否存在。 示例：我检查 /test 目录中的文件 test.txt

#include "io.h"
#include "stdio.h"

int main()
{
 if(!access("/test/test.txt",0)) printf("file exists");
}

更快的替代 file.exists()

问题描述

编辑： `access()`

2 个解决方案

解决方案1
5 2019-12-07 05:19:45

解决方案2
1 2019-12-07 05:49:14

更快的替代 file.exists()

问题描述

编辑： access()

2 个解决方案

解决方案1 5 2019-12-07 05:19:45

解决方案2 1 2019-12-07 05:49:14

编辑： `access()`

解决方案1
5 2019-12-07 05:19:45

解决方案2
1 2019-12-07 05:49:14