简体   繁体   English

当我只有总文件的子集时,如何应用TF-IDF?

[英]How Do I Apply TF-IDF When I Only Have a Subset of the Total Documents?

Practical application: 实际应用:

I have several databases that need to be queried from a single search box. 我有几个数据库需要从一个搜索框中查询。 Some of them I have direct access to (they're SQL Server / MySQL), others I can only search via an API. 其中一些我可以直接访问(他们是SQL Server / MySQL),其他我只能通过API搜索。

In an ideal world I would inject all of this data into Elasticsearch and use it to determine relevance. 在理想的世界中,我会将所有这些数据注入Elasticsearch并使用它来确定相关性。 Unfortunately I don't have the resources locally to make that run efficiently. 不幸的是,我没有本地资源来提高运行效率。 Elastic is taking over 400mb of RAM just while idling without adding any actual data or running queries. Elastic正在空闲时占用400mb的RAM而不添加任何实际数据或运行查询。 It looks like most people using Elasticsearch in production are running machines with 32GB - 64GB of RAM. 看起来大多数在生产中使用Elasticsearch的人都在运行具有32GB-64GB RAM的机器。 My organization doesn't have access to anything near that powerful available for this project. 我的组织无法访问该项目可用的强大功能。

So my next idea is to query all the databases and connect to the API's when the user makes a search. 所以我的下一个想法是查询所有数据库并在用户进行搜索时连接到API。 Then I need to analyze the results, determine relevance, and return them to the user. 然后我需要分析结果,确定相关性,并将它们返回给用户。 I recognize that this is probably a terrible plan in terms of performance. 我认识到这可能是一个糟糕的性能计划。 I'm hoping to use memcached to make things more tolerable. 我希望使用memcached来使事情更容易忍受。

In my research for finding algorithms to determine relevance, I came across tf-idf. 在我找到确定相关性的算法的研究中,我遇到了tf-idf。 I'm looking to apply this to the results I get back from all the databases. 我希望将此应用于我从所有数据库中返回的结果。

The actual question 实际的问题

My understanding of tf-idf is that after tokenizing every document in the corpus, you perform a term frequency analysis and then multiply it against the inverse document frequency for the words. 我对tf-idf的理解是,在对语料库中的每个文档进行标记后,执行术语频率分析,然后将其与单词的逆文档频率相乘。 The inverse document frequency is calculated by dividing the total document count by the the total number of documents with the term. 通过将总文档计数除以具有该术语的文档总数来计算逆文档频率。

The problem with this is that if I'm pulling documents from an API, I don't know the true total number of documents in the corpus. 这样做的问题是,如果我从API中提取文档,我不知道语料库中文档的真实总数。 I'm only ever pulling a subset, and based on the way those documents are being pulled they're naturally going to all of the terms in them. 我只是拉了一个子集,根据这些文件被拉出来的方式,他们自然会使用它们中的所有术语。 Can I still apply tf-idf to this by treating the pool of documents returned by these various sources as a single corpus? 我是否仍然可以通过将这些不同来源返回的文档池视为单个语料库来应用tf-idf? What's the best way to go about this? 最好的方法是什么?

Bonus question 奖金问题

If you have a suggestion for how to accomplish this without hacking together my own search solution or using Elasticsearch I'm all ears... 如果你有一个关于如何实现这一目标的建议,而不是将我自己的搜索解决方案或使用Elasticsearch混淆在一起,我会全力以赴......

As you have noticed Elasticsearch is not built to run in memory constrained environments. 正如您所注意到的,Elasticsearch不是为在内存受限的环境中运行而构建的。 If you want to use Elasticsearch, but can't set up a dedicated machine, you might consider using a hosted search solution (eg AWS Elasticsearch, Elastic Cloud, Algolia, etc.). 如果您想使用Elasticsearch,但无法设置专用计算机,则可以考虑使用托管搜索解决方案(例如AWS Elasticsearch,Elastic Cloud,Algolia等)。 Those solutions still cost though! 这些解决方案仍然需要花费

There are two great alternatives that require a bit more work (but not as much as writing your own search solution). 有两个很好的选择需要更多的工作(但不如编写自己的搜索解决方案)。 Lucene is the actual Search Engine that Elasticsearch is written on top of. Lucene是Elasticsearch编写的实际搜索引擎。 It does still load quite a bit of the underlying data structures into memory, so, depending on the size of the underlying data you want to index, it could still run out of memory. 它仍然会将相当多的底层数据结构加载到内存中,因此,根据您要索引的基础数据的大小,它仍然可能会耗尽内存。 But, you should be able to fit quite a bit more data in a single Lucene index than in an entire Elasticsearch instance. 但是,您应该能够在单个Lucene索引中使用比在整个Elasticsearch实例中更多的数据。

The other alternative, which I know slightly less about, is Sphinx. 我知道的另一种选择是Sphinx。 It is also a Search Engine. 它也是一个搜索引擎。 And it also allows you to specify how much memory to allocate for it to use. 它还允许您指定要为其使用分配的内存量。 It stores the rest of the data on disk. 它将其余数据存储在磁盘上。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何查询具有空白列表的文档? - How do I query for documents that have a blank list? 在Mongo中,我如何找到所有文档,但仅显示其标题? - In Mongo, how do I find all the documents, but display only their titles? 如何查询ActiveModel记录的子集? - How do I query on a subset of ActiveModel records? 如何在表上放置约束以确保表的子集中只有一个布尔列为真? - How do I put a constraint on a table to ensure only one boolean column across a subset of tables is true? 使用聚合时如何在运行限制之前获取记录总数 - How do I get the total number of records before I run limit when using Aggregation 如何修改此查询以添加一个新字段,其中包含原始记录总数的子集的某个字段的最大值? - How can I modify this query to add a new field containing the maximum value of a field of a subset of the total original records? 如何有选择地调用此 APOC 程序? (仅在节点的子集上) - How can I call this APOC procedure selectively? (only on a subset of nodes) 在Visual Studio中创建数据库项目后,如果编辑存储过程或视图,如何将更改应用到服务器? - Once I have a database project in Visual Studio, if I edit a stored procedure or view, how do I apply the changes to the server? 何时以及如何在Grails中创建索引? - When and how do I have to create an Index in Grails? 我如何限制 Firebase 中带有 Flutter 日期的文档数量 - How do i limit the number of documents from Firebase with dates in Flutter
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM