简体   繁体   中英

how do I write a program that takes a list of strings as input and returns a dictionary, containing an index of the words to the matching strings

Rules: In the dictionary, each key will be a word k, while the value will be a list of indices of the input strings where the word k appears.

Words should be treated as lowercase only. ie Hello and hello should be treated the same.

it can be assumed that the dataset will contain only lists of strings. no need to check the type of the elements in the dataset.

The string data in the dataset will be clean. no need to worry about cleaning ie removing punctation marks or numbers.

In the example below, the function determines what the indices of the words in the given dataset are. dataset is the list containing the strings.

The reverse_index function is supposed to create and return the dictionary.


dataset = [
    "Hello world",
    "This is the WORLD",
    "hello again"
 ]
res = reverse_index(dataset)

# This assertion checks if the result equals the expected dictinary
assert(res == {
    'hello': [0, 2],
    'world': [0, 1],
    'this': [1],
    'is': [1],
    'the': [1],
    'again':[2]
  })

I'm not really sure of what to do next but this is how I started

dataset = [
    "Hello world",
    "This is the WORLD",
    "hello again"
 ] 

def reverse_index(dataset):


You can try this method

def reverse_index(data):
    res = dict()
    for i in range(len(data)):
        for word in map(str.lower,data[i].split()):
            if word not in res:
                res[word] = [i,]
            else:
                res[word].append(i)
    return res

output:

{
    'hello': [0, 2],
    'world': [0, 1],
    'this': [1],
    'is': [1],
    'the': [1],
    'again':[2]
}

You can use collections.defaultdict as a basis and a small loop:

from collections import defaultdict
res = defaultdict(list)
for i,s in enumerate(dataset):
    for w in set(map(str.lower, s.split())):
        res[w].append(i)
dict(res)

output:

{'hello': [0, 2],
 'world': [0, 1],
 'is': [1],
 'the': [1],
 'this': [1],
 'again': [2]}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM