简体繁体 English

当我只有总文件的子集时，如何应用TF-IDF？

[英]How Do I Apply TF-IDF When I Only Have a Subset of the Total Documents?

原文 2018-06-01 03:56:30 5 1 database/ elasticsearch/ search/ tf-idf

Practical application: 实际应用：

I have several databases that need to be queried from a single search box. 我有几个数据库需要从一个搜索框中查询。 Some of them I have direct access to (they're SQL Server / MySQL), others I can only search via an API. 其中一些我可以直接访问（他们是SQL Server / MySQL），其他我只能通过API搜索。

In an ideal world I would inject all of this data into Elasticsearch and use it to determine relevance. 在理想的世界中，我会将所有这些数据注入Elasticsearch并使用它来确定相关性。 Unfortunately I don't have the resources locally to make that run efficiently. 不幸的是，我没有本地资源来提高运行效率。 Elastic is taking over 400mb of RAM just while idling without adding any actual data or running queries. Elastic正在空闲时占用400mb的RAM而不添加任何实际数据或运行查询。 It looks like most people using Elasticsearch in production are running machines with 32GB - 64GB of RAM. 看起来大多数在生产中使用Elasticsearch的人都在运行具有32GB-64GB RAM的机器。 My organization doesn't have access to anything near that powerful available for this project. 我的组织无法访问该项目可用的强大功能。

So my next idea is to query all the databases and connect to the API's when the user makes a search. 所以我的下一个想法是查询所有数据库并在用户进行搜索时连接到API。 Then I need to analyze the results, determine relevance, and return them to the user. 然后我需要分析结果，确定相关性，并将它们返回给用户。 I recognize that this is probably a terrible plan in terms of performance. 我认识到这可能是一个糟糕的性能计划。 I'm hoping to use memcached to make things more tolerable. 我希望使用memcached来使事情更容易忍受。

In my research for finding algorithms to determine relevance, I came across tf-idf. 在我找到确定相关性的算法的研究中，我遇到了tf-idf。 I'm looking to apply this to the results I get back from all the databases. 我希望将此应用于我从所有数据库中返回的结果。

The actual question 实际的问题

My understanding of tf-idf is that after tokenizing every document in the corpus, you perform a term frequency analysis and then multiply it against the inverse document frequency for the words. 我对tf-idf的理解是，在对语料库中的每个文档进行标记后，执行术语频率分析，然后将其与单词的逆文档频率相乘。 The inverse document frequency is calculated by dividing the total document count by the the total number of documents with the term. 通过将总文档计数除以具有该术语的文档总数来计算逆文档频率。

The problem with this is that if I'm pulling documents from an API, I don't know the true total number of documents in the corpus. 这样做的问题是，如果我从API中提取文档，我不知道语料库中文档的真实总数。 I'm only ever pulling a subset, and based on the way those documents are being pulled they're naturally going to all of the terms in them. 我只是拉了一个子集，根据这些文件被拉出来的方式，他们自然会使用它们中的所有术语。 Can I still apply tf-idf to this by treating the pool of documents returned by these various sources as a single corpus? 我是否仍然可以通过将这些不同来源返回的文档池视为单个语料库来应用tf-idf？ What's the best way to go about this? 最好的方法是什么？

Bonus question 奖金问题

If you have a suggestion for how to accomplish this without hacking together my own search solution or using Elasticsearch I'm all ears... 如果你有一个关于如何实现这一目标的建议，而不是将我自己的搜索解决方案或使用Elasticsearch混淆在一起，我会全力以赴......

1 个解决方案

As you have noticed Elasticsearch is not built to run in memory constrained environments. 正如您所注意到的，Elasticsearch不是为在内存受限的环境中运行而构建的。 If you want to use Elasticsearch, but can't set up a dedicated machine, you might consider using a hosted search solution (eg AWS Elasticsearch, Elastic Cloud, Algolia, etc.). 如果您想使用Elasticsearch，但无法设置专用计算机，则可以考虑使用托管搜索解决方案（例如AWS Elasticsearch，Elastic Cloud，Algolia等）。 Those solutions still cost though! 这些解决方案仍然需要花费

There are two great alternatives that require a bit more work (but not as much as writing your own search solution). 有两个很好的选择需要更多的工作（但不如编写自己的搜索解决方案）。 Lucene is the actual Search Engine that Elasticsearch is written on top of. Lucene是Elasticsearch编写的实际搜索引擎。 It does still load quite a bit of the underlying data structures into memory, so, depending on the size of the underlying data you want to index, it could still run out of memory. 它仍然会将相当多的底层数据结构加载到内存中，因此，根据您要索引的基础数据的大小，它仍然可能会耗尽内存。 But, you should be able to fit quite a bit more data in a single Lucene index than in an entire Elasticsearch instance. 但是，您应该能够在单个Lucene索引中使用比在整个Elasticsearch实例中更多的数据。

The other alternative, which I know slightly less about, is Sphinx. 我知道的另一种选择是Sphinx。 It is also a Search Engine. 它也是一个搜索引擎。 And it also allows you to specify how much memory to allocate for it to use. 它还允许您指定要为其使用分配的内存量。 It stores the rest of the data on disk. 它将其余数据存储在磁盘上。