简体   繁体   English

BaseX中的基准测试:如何设置

[英]Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. 目前,我是一个研究小组的实习生,可以搜索大量的文本(语料库)。 Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. 不仅可以搜索文字字符串,更重要的是,还可以查找与给定输入类似的语法依赖结构,而无需精通任何编程语言或语料库注释样式。 It may be clear that this tool is intended for linguists. 很明显,这个工具是针对语言学家的。

At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). 在项目开始时 - 在我参与项目之前 - 该工具仅限于相当小的语料库(最多900万字)。 The goal is to make large sets of texts searchable as well. 目标是使大量文本也可搜索。 We are talking about +- 500 millions words. 我们正在讨论+ - 5亿字。 Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper ) but this has not been tested yet. 已经尝试过,理论上应该通过减少搜索空间来提高速度(参见本文 ),但尚未经过测试。 The results of this attempt is a new file structure. 此尝试的结果是一个新的文件结构。 Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX. 让我们称这个结构为B,与未处理的结构A相比。我们希望B在使用BaseX查询时提供更快的结果。

My question is: what is the best way to test and compare both approaches with a Perl script? 我的问题是:用Perl脚本测试和比较两种方法的最佳方法是什么? Below you find my current script to query BaseX locally. 您可以在下面找到我当前在本地查询BaseX的脚本。 It takes two arguments. 它需要两个参数。 A directory that stores different files. 存储不同文件的目录。 These files each individually store XPaths. 这些文件分别存储XPath。 Those XPaths are the ones that I have selected to benchmark with. 那些XPath是我选择用于基准测试的那些。 A second argument is the limit of results to return. 第二个参数是要返回的结果的限制。 When set to zero, no limit is set. 设置为零时,不设置限制。

Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. 因为数据集的某些部分非常庞大,我们将它们分成不同的,大小相同的文件,称为treebankparts。 They are stored in <tb> tags inside treebankparts.lst . 它们被存储在<tb>内标签treebankparts.lst

#!/usr/bin/perl

use warnings;

$| = 1;    # flush every print

# Directory where XPaths are stored
my $directory = shift(@ARGV);

# Set limit. If set to zero all results will be returned
my $limit = shift(@ARGV);

# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);

# List all files in directory
@xpathfiles = <$directory/*.txt>;

# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my @tlines = <$tfh> );
close $tfh;

# Loop through all XPaths in $directory
foreach my $xpathfile (@xpathfiles) {
    open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
    chomp( my @xlines = <$xfh> );
    close $xfh;

    print STDOUT "File = $xpathfile\n";

    # Loop through lines from XPath file (= XPath query)
    foreach my $xline (@xlines) {
        # Loop through the lines of treebank file
        foreach my $tline (@tlines) {
            my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
            QuerySonar( $xline, $treebank );
        }
    }
}
$session->close();

sub QuerySonar {
    my ( $xpath, $db ) = @_;

    print STDOUT "Querying $db for $xpath\n";
    print STDOUT "Limit = $limit\n";
    my $x_limit;
    my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
      . $xpath . ';';
    my $x_open       = '<results>';
    my $x_totalcount = '<total>{count($results)}</total>';
    my $x_loopinit   = '{for $node at $limitresults in $results';

    # Spaces are important!
    if ( $limit > 0 ) {
        $x_limit = ' where $limitresults <= ' . $limit . ' ';
    }
    # Comment needed to prevent `Incomplete FLWOR expression`
    else { $x_limit = '(: No limit set :)'; }

    my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/@id)
        let $sentence := ($node/ancestor::alpino_ds/sentence)
        let $begin := ($node//@begin)
        let $idlist := ($node//@id)
        let $beginlist := (distinct-values($begin))';

    # Separate sentence info by tab
    my $x_loopexit = 'return <match>{data($sentid)}&#09;
        {string-join($idlist, "-")}&#09;
        {string-join($beginlist, "-")}&#09;
        {data($sentence)}</match>}';
    my $x_close = '</results>';

    # Concatenate all XQuery parts
    my $x_concatquery =
        $x_resultsofxp
      . $x_open
      . $x_totalcount
      . $x_loopinit
      . $x_limit
      . $x_sentenceinfo
      . $x_loopexit
      . $x_close;

    my $querysent = $session->query($x_concatquery);

    my $basexoutput = $querysent->execute();
    print $basexoutput. "\n\n";

    $querysent->close();
}

(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!) (注意,这是一个精简版和原样。这个片段使用结构B可能无法正常工作!)

What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. 会发生什么:遍历所有XPath文件,遍历XPath文件中的每一行,遍历所有treebankparts然后执行sub。 The sub then queries BaseX. 然后子查询BaseX。 This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). 这归结为向BaseX发送新的XQuery,并返回总命中和结果(可能受Perl脚本中的参数限制)。 So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it. 所以我明白了,但现在的问题是:如何改进这个脚本,以便我可以从中获得一些基准测试结果。

First of all, I'd start with adding a profiler to this script. 首先,我开始在这个脚本中添加一个探查器。 I guess that bit is obvious. 我猜这一点很明显。 However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? 但是,我不确定应该如何开始将结构A与B进行比较。我是否将两个查询(到不同的数据库)放在不同的脚本中,然后在两者上调用一个探查器,并多次运行这两个脚本并得到一个平均值并比较? Or would I run each query by both databases in the same script, almost at the same time? 或者我会在同一个脚本中同时运行两个数据库的每个查询吗?

It is important to consider caching that is happening. 考虑正在发生的缓存很重要。 Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. 因此,我不完全确定这个巨大的数据库基准测试的构建是否合适。 First one script, then the other. 第一个脚本,然后是另一个。 Both at the same time. 两者同时进行。 Alternating queries between the two. 两者之间的交替查询。 And so on. 等等。 There are so many possibilities, but I wonder which would provide the best results. 有这么多的可能性,但我想知道哪个会提供最好的结果。 Also, I would repeat the process a couple of times. 另外,我会重复这个过程几次。 Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again? 我会重复每个查询然后继续下一个,或者完成所有XPath文件,然后再重复整个过程吗?

(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.) (阅读基准标签的描述我相信这个 - 虽然精心制作 - 帖子适合SO。)

There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). 我们必须在这里分离几件事:第一个问题是BaseX性能不应该与你的perl脚本混淆,因为你的perl脚本似乎只是构造一个XQuery(而不是你在问题和标签中建议的XPath)。 So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. 因此,对于测试,我建议使用一些适用于您的实际场景的已经预先定义的XQueries,因为您的XQuery构造应该可以忽略不计。 How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. 如何将查询传递给BaseX,因此通过Perl API或通过任何其他方式不应该相关。 Even if your perl performance is relevant, you should test the performance separately. 即使您的perl性能相关,也应该单独测试性能。

Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. 因此,您是否应该将测试两个场景放在同一个脚本中的原始问题不再相关。 Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script. 相反,您只需在没有perl脚本的情况下自行为场景A和B执行两个单独的XQueries。

You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements. 担心缓存是部分正确的,但是Java JIT编译器很可能与此相关(因为BaseX是用java编写的,JIT和使用缓存,而不是BaseX本身。因此,您应该使用客户端/服务器基础结构和拥有一台长期运行的服务器并在运行性能测量之前进行预热。

Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). 关于性能:BaseX GUI和命令行已经包含一个测量(使用命令行可以设置-V来获取解析,编译,评估和打印的运行时间)。 Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times. 此外,使用-r参数可以多次执行查询,它将为您提供平均执行时间。

In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. 通常,如果要提高脚本的性能,应该查看查询计划和优化的查询,并检查是否使用了适当的索引。 Also, our new Selective Indexing might be very useful to you. 此外,我们新的选择性索引可能对您非常有用。 If the index isn't used, your query will definitely not perform well for 500 million words. 如果未使用索引,则查询肯定不会在5亿字的情况下表现良好。

Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML. 完全披露:我和BaseX团队在一起,您可能会在BaseX邮件列表中得到更好的帮助,或者可能想要引用这些问题,因为我们的首席架构师并不像ML那样定期关注。

One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. 一种可能的改进:最小化将控制权从Perl传输到数据库的次数 - 就像您已将数据库连接数量最小化一样。 (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl. (或者至少让自己测量转移控制的成本。)我怀疑如果你将循环转移到XQuery而不是在Perl中运行循环,你会得到明显更好的结果。

A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. 对数据库管理系统的单次调用要求它执行1000次搜索可能比每次请求单次搜索的DBMS的1000次调用快一些。 The first involves two context switches: one from your script or bash to the dbms, and one back; 第一个涉及两个上下文切换:一个从您的脚本或bash到dbms,一个返回; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; 第二次涉及2000.上次我仔细测量这样的事情,每个上下文切换花费500毫秒; it mounted up fast. 它安装得很快。 (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.) (也就是说,这是很久以前,有一个不同的数据库。但是,令人惊讶的是[并且发人深省]学习而不是我试图比较的两个查询公式之间的差异与运行测试循环之间的差异相形见绌。脚本或dbms内部。)

A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. 第二个建议:根据您的说法,数据库的大小和结果集似乎可能确保运行之间的缓存不会对结果产生很大影响。 But this seems to be a testable assertion, and one worth testing. 但这似乎是一个可测试的断言,值得测试。 So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done 因此,设置您的A和B脚本,然后进行试运行: for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; donefor runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done产生的结果可与for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done相媲美for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done ? for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. 如果它们具有可比性,那么您有理由相信如果单独或交替运行A和B并不重要。 If they are not comparable, then you know it does matter, which would be very valuable information. 如果它们不具有可比性,那么你知道它确实重要,这将是非常有价值的信息。 Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. 在其他条件相同的情况下,我希望在继续执行下一个查询之前,在多次运行一个查询时会产生较低的缓存次数,如果只运行一次查询,则缓存未命中会产生更高的时间。 Probably worth measuring. 可能值得一试。

In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query. 本着同样的精神,我建议您使用Perl脚本中的循环和XQuery查询中的循环来运行测试。

A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. 第三个建议:在实践中,语料库查询界面的查询将涉及几个阶段,每个阶段都有可测量的时间:从用户的浏览器(假设它是Web界面)向服务器传输查询,将请求转换为表单适合传输到后端dbms(此处为BaseX),上下文切换到BaseX,在BaseX内处理,上下文切换回,由Web服务器处理,传输给用户。 It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX. 至少对每个步骤所涉及的时间进行粗略估计是有用的,或者至少是所有东西 - 但是BaseX所花费的时间。

So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of 因此,如果是我运行测试,我想我还准备了一组空的XQuery测试,就像我一样

2 + 3

or just 要不就

42

to push the BaseX time as close to zero as possible; 将BaseX时间推至尽可能接近零; the measured time between user initiation of the query and display of response is the per-query overhead. 用户启动查询和显示响应之间的测量时间是每个查询的开销。 (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...) (有趣的问题:是否应该使用许多不同的平凡表达式来防止缓存结果,或者应该反复使用相同的表达式,以鼓励缓存结果?我们如何尝试确保BaseX将缓存结果,但是Web服务器不会?...)

A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. 最后的建议:请记住,其他需要进行基准测试的人通常会遇到与您相同的问题。 This means that you can reformulate every question of the form "Should I do X or Y?" 这意味着您可以重新形成“我应该做X还是Y?”形式的每个问题。 into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" 形式为“X和Y之间的差异对基准测试的结果有什么可衡量的影响?” Run some tests to try to measure that effect, and write them up . 运行一些测试以尝试测量该效果,然后编写它们 (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.) (我总是发现如果我在制定问题之后但在测量差异之前强迫自己做出预测会使它更有趣。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM