简体   繁体   English

文本挖掘:查询搜索

[英]Text Mining: Query search

I have a dictionary: 我有一本字典:

{'Farage': [0, 5, 9, 192,233,341],
 'EU': [0, 1, 5, 6, 9, 23]}

Query1: “Farage” and “EU”
Query2: “Farage” or “EU”

I need to return the documents that contain these queries. 我需要返回包含这些查询的文档。 For query1, for example, the answer should be [0,5,9]. 例如,对于query1,答案应为[0,5,9]。 I believe the answer should be something like that but in python: 我相信答案应该是这样的,但在python中:

final_list = []
while x≠Null and y≠Null
    do if docID(x)=docID(y)
       then ADD(final_list, docID(x))
          x← next(x)
          y ←next(y)
        else if docID(x) < docID(y)
          then x← next(x)
          else y ←next(y)
return final_list

Please help. 请帮忙。

You could create your own function using sets , a structure that Python provides and works best for your case by speeding up the process of joining and intersecting sequences of elements: 您可以使用sets创建一个自己的函数, sets是Python提供的结构,可以通过加快元素序列的连接相交过程来最适合您的情况:

def getResults(s, argument):
    s = list(s.values())
    if argument == 'OR':
        result = s[0]
        for elem in s[1:]:
            result = sorted(set(result).union(set(elem)))
        return result
    elif argument == 'AND':
        result = s[0]
        for elem in s[1:]:
            result = sorted(set(result).intersection(set(elem)))
        return result
    else:
        return None

inDict = {'Farage': [0, 5, 9, 192,233,341], 'EU': [0, 1, 5, 6, 9, 23]}

query1 = getResults(inDict, 'AND')
query2 = getResults(inDict, 'OR')

print(query1)
print(query2)

Results: 结果:

[0, 5, 9]
[0, 1, 5, 6, 9, 23, 192, 233, 341]

Note: You can remove the sorted function if you do not want any sorting. 注意:如果不想进行任何排序,则可以删除sorted函数。

You can create a dict of operators and throw set operations to get the final results. 您可以创建一个运算符dict并抛出set操作以获得最终结果。 It assumes that queries follow strict rule of key1 operator key2 operator key3 假定查询遵循key1 operator key2 operator key3严格规则

For arbitrary number of arguments 对于任意数量的参数

import operator
d1={'Farage': [0, 5, 9, 192,233,341],
    'EU': [0, 1, 5, 6, 9, 23],
    'hopeless': [0, 341, 19999]}

d={'and':operator.and_,
  'or':operator.or_}

Queries= ['Farage and EU','Farage and EU or hopeless','Farage or EU']

for query in Queries:
    res=set()
    temp_arr = query.split()
    k1 = temp_arr[0]

    for value in range(1,len(temp_arr),2):
        op = temp_arr[value]
        k2 = temp_arr[value+1]
        if res:
            res = d[op](res, set(d1.get(k2, [])))
        else:
            res = d[op](set(d1.get(k1, [])), set(d1.get(k2, [])))
    print(res)

Output 产量

set([0, 9, 5])
set([0, 192, 5, 233, 9, 19999, 341])
set([0, 192, 5, 6, 1, 233, 23, 341, 9])

Bare in mind, use the conversion into sets: 切记,使用转换成组:

>>> d = {'Farage': [0, 5, 9, 192, 233, 341] , 'EU': [0, 1, 5, 6, 9, 23]}
>>> d
{'EU': [0, 1, 5, 6, 9, 23], 'Farage': [0, 5, 9, 192, 233, 341]}
>>>
>>> set(d['EU']) | set(d['Farage'])
{0, 1, 192, 5, 6, 9, 233, 341, 23}
>>>
>>> set(d['EU']) & set(d['Farage'])
{0, 9, 5}
>>>
>>> set(d['EU']) ^ set(d['Farage'])
{192, 1, 23, 233, 341, 6}
>>>
>>> set(d['EU']) - set(d['Farage'])
{1, 6, 23}

Or change the format of the input if it is possible for the dictionary to be directly in the form of the set, that is: 或者,如果字典可以直接以集合的形式出现,则更改输入的格式,即:

>>> d = {'Farage': {0, 5, 9, 192, 233, 341}, 'EU': {0, 1, 5, 6, 9, 23}}
>>> d['EU'] & d['Farage']
{0, 9, 5}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM