简体   繁体   English

根据查询将字典值的出现次数计算为嵌套列表

[英]Count number of occurrences of dictionary values as nested list based on a query

I built this inverted index:我建立了这个倒排索引:

{
    'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...}, 

    'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}

}

For example, the term experiment appears in document 1 one time at index zero, in document 30 two times at indices 12 and 40, etc. I am wondering how I could count the number of occurrences of each term in the dictionary based on a dictionary of queries that looks like this:例如,术语experiment在文档 1 中出现一次,索引为 0,在文档 30 中出现两次,索引为 12 和 40,等等。我想知道如何根据字典计算字典中每个术语的出现次数看起来像这样的查询:

{
    'q1'  : ['similar', 'law', ..., 'speed', 'aircraft'],
    'q2'  : ['structur', 'aeroelast', ..., 'speed', 'aircraft'], 
    ...
    'q225': ['design', 'factor', ..., 'number', '5']
}

The desired output would look something like this:所需的 output 看起来像这样:

{
    'q1'  : ['d51', 'd874', ..., 'd717'], 
    'q2'  : ['d51', 'd1147', ..., 'd14'],
    ...,
    'q225': ['d1313', 'd996', ..., 'd193']
}

With keys representing the query and values representing the documents that the query appeared in, and the list would be sorted in descending order of total term frequencies使用代表查询的键和代表查询出现的文档的值,列表将按总词频的降序排序

Map queries to document vectors Map 查询文档向量

A document vector is a dict with items (document, word_count) .文档向量是带有项目(document, word_count)的字典。 These vectors can be added together by summing the word count for matching document keys with a default word_count of 0.这些向量可以通过将匹配文档键的字数与默认 word_count 为 0 相加来相加。

CONVERT INDEX TO DOC VECTORS将索引转换为文档向量

full_index = {
    'experiment': {'d1': [1, [0]],  'd30': [2, [12, 40]],  'd123': [3, [11, 45, 67]] } ,
    'study': {'d1': [1, [1]], 'd2': [2, [0, 36]],  'd207': [3, [19, 44, 59]]}
}

def count_only(docs):
    return {d: occurences[0] for d, occurences in docs.items()}

doc_vector_index = {w: count_only(docs) for w, docs in full_index.items()}

MAP LIST OF QUERY WORDS TO LIST OF DOC VECTORS MAP 查询词列表到文档向量列表

for q, words in queries.items():
    vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]

SUM DOC VECTORS AND SORT对文档向量求和并排序

def doc_vector_add(ldoc, rdoc):
    res = ldoc.copy()
    for doc, count in rdoc.items():
        res[doc] = ldoc.get(doc,0) + count
    return res

for q, words in queries.items():
    vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
    total_vector = dict(sorted(functools.reduce(doc_vector_add, vectors, {}).items(), 
        key=lambda item: item[1], 
        reverse=True))
    output[q] = list(total_vector.keys())

The summation of doc vectors is handled using reduce functools.reduce(doc_vector_add, vectors, {}) .使用 reduce functools.reduce(doc_vector_add, vectors, {})处理文档向量的总和。 This produces the doc vector that is the sum of the individual vectors for each word in the query.这会生成文档向量,它是查询中每个单词的各个向量的总和。 sorted is used to sort the keys of the vector. sorted用于对向量的键进行排序。

LIMIT TO TOP N DOCUMENTS限于前 N 个文件

max_doc_limit = 10
output[q] = list(total_vector.keys())[:max_doc_limit]

Limiting the documents can be handled by slicing before assigning to the output.在分配给 output 之前,可以通过切片来限制文档。

ORDER BY COUNT DESC, DOC_ID ASC ORDER BY COUNT DESC, DOC_ID ASC

sorted(...,key=lambda item: (item[1], -1*int(item[0][1:]),...)

We can change the sorting order of the output by changing the key function passed to sorted .我们可以通过更改传递给sorted的键 function 来更改 output 的排序顺序。 We use a trick of multiplying the second element in the tuple by -1 to reverse the order from descending to ascending.我们使用将元组中的第二个元素乘以 -1 的技巧来反转从降序到升序的顺序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM