简体   繁体   English

搜索引擎不精确计数(大约xxx个结果)

[英]Search Engines Inexact Counting (about xxx results)

When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"... 当您在Google中搜索时(我几乎可以确定Altavista做了同样的事情),它显示“结果1-10,共xxxx” ...

This has always amazed me... What does it mean "about"? 这总是让我感到惊讶...“约”是什么意思?
How can they count roughly? 他们如何粗略计算?
I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one? 我知道他们为什么不能在合理的时间内提出精确的数字,但是他们怎么能达到这个“近似”的数字呢?

I'm sure there's a lot of theory behind this one that I missed... 我敢肯定,我错过了很多理论……

Most likely it's similar to the sort of estimated row counts used by most SQL systems in their query planning; 它很可能类似于大多数SQL系统在其查询计划中使用的估计行数。 a number of rows in the table (known exactly as of the last time statistics were collected, but generally not up-to-date), multiplied by an estimated selectivity (usually based on a sort of statistical distribution model calculated by sampling some small subset of rows). 表中的许多行(确切地知道上次收集统计信息的时间,但通常不是最新的),乘以估计的选择性(通常基于通过对一些小子集进行采样而计算出的统计分布模型)行)。

The PostgreSQL manual has a section on statistics used by the planner that is fairly informative, at least if you follow the links out to pg_stats and various other sections. PostgreSQL手册中有一个关于计划者使用的统计信息的部分,该部分内容非常有用,至少如果您遵循指向pg_stats的链接以及其他各个部分的话。 I'm sure that doesn't really describe what google does, but it at least shows one model where you could get the first N rows and an estimate of how many more there might be. 我敢肯定,这并不能真正描述google的功能,但是它至少显示了一个模型,您可以在其中获得前N行,并可以估算出其中还有多少行。

Not relevant to your question, but reminds of a little joke a friend of mine made when doing a simple ego-search (and don't tell me you've never Googled your name). 与您的问题无关,但让我想起我的一个朋友在做一次简单的自我搜索时开了个玩笑(不要告诉我您从未用Google搜索过您的名字)。 He said something like 他说类似

"Wow, about 5,000 results in just 0.22 seconds! Now, imagine how many results this is in one minute, one hour, one day!" “哇,仅0.22秒就能得到大约5,000个结果!现在,想象一下在一分钟,一小时,一天之内有多少个结果!”

I imagine the estimate is based on statistics. 我想这个估计是根据统计数字得出的。 They aren't going to count all of the relevant page matches, so what they (I would) do is work out roughly what percentage of pages would match the query, based on some heuristic, and then use that as the basis for the count. 他们不会计算所有相关的页面匹配,所以他们(我会做)是根据某种启发式方法,大致算出与查询匹配的页面百分比,然后将其用作计算的基础。

One heuristic might be to do a sample count - take a random sample of 1000 or so pages and see what percentage matched. 一种启发式方法可能是进行样本计数-随机抽取1000个左右的页面,然后查看匹配的百分比。 It wouldn't take too many in the sample to get a statisically significant answer. 样本中不需要太多的数据就可以得出具有统计学意义的答案。

One thing that hasn't been mentioned yet is deduplication. 尚未提及的一件事是重复数据删除。 Some search engines (I'm not sure exactly how Google in particular does it) will use heuristics to try and decide if two different URLs contain the same (or extremely similar) content, and are thus duplicate results. 一些搜索引擎(我不确定确切地说Google到底是如何做到的)将使用试探法来尝试确定两个不同的URL是否包含相同(或极其相似)的内容,从而得到重复的结果。

If there are 156 unique URLs, but 9 of those have been marked as duplicates of other results, it is simpler to say "about 150 results" rather than something like "156 results which contains 147 unique results and 9 duplicates". 如果有156个唯一URL,但是其中9个已被标记为其他结果的重复,则说“大约150个结果”比“ 156个结果包含147个唯一结果和9个重复”这样的说法更简单。

Returning an exact number of results is not worth the overhead to accurately calculate. 返回准确数量的结果不值得进行准确计算的开销。 Since there's not much of a value add from knowing there was 1,004,345 results rather than 'about 1,000,000', it's more important from an end user experience perspective to return the results faster rather than the additional time to calculate the total. 由于知道有1,004,345个结果而不是“约1,000,000个”,因此并没有太多的增值,因此从最终用户体验的角度来看,更重要的是更快地返回结果,而不是花费额外的时间来计算总数。

From Google themselves: "Google's calculation of the total number of search results is an estimate. We understand that a ballpark figure is valuable, and by providing an estimate rather than an exact account, we can return quality search results faster." 来自谷歌本身: “谷歌对搜索结果总数的计算是一种估算。我们理解一个重要的数字,并且通过提供一个估算而不是一个确切的帐户,我们可以更快地返回高质量的搜索结果。”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM