简体   繁体   English

从 Rcpp 调用 agrep.Internal C function

[英]Calling the agrep .Internal C function from Rcpp

In short: How can I call, from within Rccp C++ code, the agrep C internal function that gets called when users use the regular agrep function from base R? In short: How can I call, from within Rccp C++ code, the agrep C internal function that gets called when users use the regular agrep function from base R?

In long: I have found multiple questions here about how to invoke, from within Rcpp, a C or C++ function created for another package (eg using C function from other package in Rcpp and Rcpp: Call C function from a package within Rcpp ). In long: I have found multiple questions here about how to invoke, from within Rcpp, a C or C++ function created for another package (eg using C function from other package in Rcpp and Rcpp: Call C function from a package within Rcpp ).

The thing that I am trying to achieve, however, is at the same time simpler but also way less documented: it is to directly call, from within Rcpp, a.Internal C function that comes with base R rather than another package, without interfacing with R (that is, without doing what is said in Call R functions in Rcpp ). The thing that I am trying to achieve, however, is at the same time simpler but also way less documented: it is to directly call, from within Rcpp, a.Internal C function that comes with base R rather than another package, without interfacing使用 R (也就是说,不执行 Rcpp 中的Call R 函数中所说的操作)。 How could I do that for the.Internal C function that lays underneath base R's agrep wrapper?我怎么能做到这一点。内部 C function 位于基础 R 的 agrep 包装器下方?

The specific function I am trying to call here is the agrep internal C function.我在这里尝试调用的具体 function 是 agrep 内部 C function。 And for context, what I am ultimately trying to achieve is to speed-up a call to agrep for when millions of patterns must be each checked against each of millions of x targets.对于上下文,我最终要实现的是加快对 agrep 的调用,以便在必须针对数百万个 x 目标中的每一个检查数百万个模式时。

Great question.好问题。 The long and short of it is "You cant" (in many cases) unless the function is visible in one of the header files in "src/include/".它的长短是“你不能”(在许多情况下),除非 function 在“src/include/”中的 header 文件之一中可见。 At least not that easily.至少没那么容易。

Not long ago I had a similar fun challenge, where I tried to get access to the do_docall function (called by do.call ), and it is not a simple task.不久前,我遇到了一个类似的有趣挑战,我尝试访问do_docall function(由do.call调用),这不是一项简单的任务。 First of all, it is not directly possible to just #include <agrep.c> (or something similar).首先,不能直接使用#include <agrep.c> (或类似的东西)。 That file simply isn't available for inclusion, as it is not a part of the "src/include".该文件根本无法包含,因为它不是“src/include”的一部分。 It is compiled and the uncompiled file is removed (not to mention that one should never "include" a.c file).它被编译并且未编译的文件被删除(更不用说永远不应该“包含”a.c 文件)。

If one is willing to go the mile, then the next step one could look at is "copying" and "altering" the source code.如果有人愿意 go 英里,那么下一步可以看的是“复制”和“更改”源代码。 Basically find the function in "src/main/agrep.c", copy it into your package and then fix any errors you find.基本上在“src/main/agrep.c”中找到function,将其复制到您的package中,然后修复您发现的任何错误。

Problems with this approach:这种方法的问题:

  1. As documented in R-exts the internal structures of sexprec_info is not made public (this is the base structure for all objects in R).R-exts exts 中所述, sexprec_info的内部结构没有公开(这是 R 中所有对象的基本结构)。 Many internal function use the fields within this structure, so one has to "copy" the structure into your source code, to make it public to your code specifically.许多内部 function 使用此结构中的字段,因此必须将结构“复制”到您的源代码中,以便专门将其公开给您的代码。
  2. If you ever #include <Rcpp.h> prior to this file, you will need to go through each and every call to internal functions and likely add either R_ or Rf_ .如果您在此文件之前使用过#include <Rcpp.h> ,则需要通过每次调用内部函数来 go 并可能添加R_Rf_
  3. The function may contain calls to other "internal" functions, that further needs to be copied and altered for it to work. function 可能包含对其他“内部”函数的调用,需要进一步复制和更改才能使其工作。
  4. You will also need to get a clear understanding of what CDR , CAR and similar does.您还需要清楚地了解CDRCAR和类似的功能。 The internal functions have a documented structure, where the first argument contains the full call passed to the function, and function like those 2 are used to access parts of the call.内部函数有一个记录的结构,其中第一个参数包含传递给 function 和 function 的完整调用,就像那些 2 用于访问部分调用一样。 I did myself a solid and rewrote do_docall changing the input format, to avoid having to consider this.我自己做了一个坚实的并重写了do_docall更改输入格式,以避免不得不考虑这一点。 But this takes time.但这需要时间。 The alternative is to create a pairlist according to the documentation, set its type as a call-sexp (the exact name is lost to me at the moment) and pass the appropriate arguments for op , args and env .另一种方法是根据文档创建一个配对列表,将其类型设置为 call- pairlist (目前我已经忘记了确切的名称)并为opargsenv传递适当的 arguments 。
  5. And lastly, if you go through the steps, and find that it is necessary to copy the internal structures of sexprec_info (as described later), then you will need to be very careful about when you include Rinternals and Rcpp , as any one of these causes your code to crash and burn in the most beautiful and silent way if you include your header and these in the wrong order!最后,如果您通过 go 的步骤,发现需要复制sexprec_info的内部结构(如下所述),那么您在包含RinternalsRcpp时需要非常小心,因为其中任何一个如果您将 header 和这些以错误的顺序包含在内,则会导致您的代码以最漂亮和最安静的方式崩溃和烧毁! Note that this even goes for [[Rcpp::export]] , which may indeed turn out to include them in the wrong arbitrary order!请注意,这甚至适用于[[Rcpp::export]] ,这可能确实会以错误的任意顺序包含它们!

If you are willing to go this far down the drainage, I would suggest carefully reading adv-R "R's C interface" and Chapter 2, 5 and 6 of R-ext and maybe even the R internal manual , and finally once that is done take a look at do_docall from src/main/coerce.c and compare it to the implementation in my repository cmdline.arguments/src/utils/{cmd_coerce.h, cmd_coerce.c} .如果您愿意go在排水范围之内,我建议您仔细阅读Adv-R“ R's Z0D61F8370CAD1D41D41D412F80B80B80B84D143E143E12E1212E1212 ,以及第2、5和6号和6号和6号,以及ZEE&6 Y -5和6---- 5和6---- 5和6---- 5和6----- 5和6----- 5和6----- 5和6----- 5和6------ 5和6-查看来自src/main/coerce.cdo_docall并将其与我的存储库cmdline.arguments/src/utils/{cmd_coerce.h, cmd_coerce.c}中的实现进行比较。 In this version I have在这个版本中,我有

  1. Added all the internal structures that are not public, so that I can access their unmodified form (unmodified by the current session).添加了所有不公开的内部结构,以便我可以访问它们未修改的形式(当前会话未修改)。
    • This includes the table used to store the currently used SEXP 's, that was used as a lookup.这包括用于存储当前使用的SEXP的表,该表用作查找。 This caused a problem as I can't access the modified version, so my code is slightly altered with the old code blocked by the macro #if --- defined(CMDLINE_ARGUMENTS_MAYBE_IN_THE_FUTURE) .这导致了一个问题,因为我无法访问修改后的版本,所以我的代码被宏#if --- defined(CMDLINE_ARGUMENTS_MAYBE_IN_THE_FUTURE)阻止的旧代码稍微改变了。 Luckily the code causing a problem had a static answer, so I could work around this (but this might not always be the case).幸运的是,导致问题的代码有一个 static 答案,所以我可以解决这个问题(但情况可能并非总是如此)。
  2. I added quite a few Rf_ s as their macro version is not available (since I #include <Rcpp.h> at some point)我添加了很多Rf_ s,因为它们的宏版本不可用(因为我在某些时候#include <Rcpp.h>
  3. The code has been split into smaller functions to make it more readable (for my own sake).代码已被拆分为更小的函数,以使其更具可读性(为了我自己)。
  4. The function has one additional argument (name), that is not used in the internal function, with some added errors (for my specific need). function 有一个额外的参数(名称),它没有在内部 function 中使用,还有一些额外的错误(针对我的特定需要)。

This implementation will be frozen "for all time to come" as I've moved on to another branch (and this one is frozen for my own future benefit, if I ever want to walk down this path again).当我转移到另一个分支时,这个实现将“永远”被冻结(如果我想再次走这条路,这个实现将被冻结是为了我自己未来的利益)。

I spent a few days scouring the internet for information on this and found 2 different posts, talking about how this could be achieved, and my approach basically copies this.我花了几天时间在互联网上搜索这方面的信息,发现了 2 个不同的帖子,都在谈论如何实现这一点,而我的方法基本上是复制这个。 Whether this is actually allowed in a cran package, is an whole other question (and not one that I will be testing out).在起重机 package 中是否真的允许这样做是另一个问题(而不是我将要测试的问题)。

This approach goes again if you want to use not-public code from other packages.如果您想使用其他包中的非公共代码,则此方法再次适用。 While often here it is as simple as "copy-paste" their files into your repository.虽然通常在这里它就像“复制粘贴”他们的文件到您的存储库一样简单。

As a final side note, you mention the intend is to "speed up" your code for when you have to perform millions upon millions of calls to agrep .作为最后的旁注,您提到的目的是在您必须执行数百万次对agrep的调用时“加速”您的代码。 It seems that this is a time where one should consider performing the task in parallel.似乎这是一个应该考虑并行执行任务的时候。 Even after going through the steps outlined above, creating N parallel sessions to take care of K evaluations each (say 100.000), would be the first step to reduce computing time.即使在完成上述步骤之后,创建 N 个并行会话来处理 K 次评估(例如 100.000)将是减少计算时间的第一步。 Of course each session should be given a batch and not a single call to agrep .当然,应该给每个 session 一个批次,而不是一次调用agrep

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM