简体   繁体   中英

Python: How to use list of keywords to search for a string in a text

So I'm writing a program that loops through multiple.txt files and searches for any number of pre-specified keywords. I'm having some trouble finding a way to pass through the keywords list to be searched for.

The code below currently returns the following error:

TypeError: 'in <string>' requires string as left operand, not list

I'm aware that the error is caused by the keyword list but I have no idea how to input a large array of keywords without it running this error.

Current code:

from os import listdir

keywords=['Example', 'Use', 'Of', 'Keywords']
 
with open("/home/user/folder/project/result.txt", "w") as f:
    for filename in listdir("/home/user/folder/project/data"):
        with open('/home/user/folder/project/data/' + filename) as currentFile:
            text = currentFile.read()
            #Error Below
            if (keywords in text):
                f.write('Keyword found in ' + filename[:-4] + '\n')
            else:
                f.write('No keyword in ' + filename[:-4] + '\n')

The error is indicated in line 10 in the above code under the commented section. I'm unsure as to why I can't call a list to be able to search for the keywords. Any help is appreciated, thanks!

try looping through the list to see if each element is in the text

for i in range(0, len(keywords)):
    if keywords[i] in text:
        f.write('Keyword found in ' + filename[:-4] + '\n')
        break
    else:
        f.write('No keyword in ' + filename[:-4] + '\n')
        break

you cannot use in too see if a list is in a string

I would use regular expressions as they are purpose-built for searching text for substrings.

You only need the re.search block. I added examples of findall and finditer to demystify them.

# lets pretend these 4 sentences in `text` are 4 different files
text = '''Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum'''.split(sep='. ')

# add more keywords
keywords=[r'publishing', r'industry']
regex = '|'.join(keywords)
import re
for t in text:
    lst = re.findall(regex, t, re.I) # re.I make case-insensitive
    for el in lst:
        print(el)

    iterator = re.finditer(regex, t, re.I)
    for el in iterator:
        print(el.span())

    if re.search(regex, t, re.I):
        print('Keyword found in `' + t + '`\n')
    else:
        print('No keyword in `' + t + '`\n')

Output:

industry
(65, 73)
Keyword found in `Lorem Ipsum is simply dummy text of the printing and typesetting industry`

industry
(25, 33)
Keyword found in `Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book`

No keyword in `It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged`

publishing
(132, 142)
Keyword found in `It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum`

You could replace

if (keywords in text):
   ...

with

if any(keyword in text for keyword in keywords):
   ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM