簡體   English   中英

從列表中刪除帶有自定義停用詞的短語

[英]Removing phrases with custom stop words from a list

我有兩個清單

listA = ['New Delhi', 'Moscow', 'Berlin', 'France', 'To Washington']
stopwordlist = ['new', 'To']

我正在嘗試得到這樣的東西

finalList = ['Moscow', 'Berlin', 'France']

如果我一直在尋找完整的單詞,那么直到現在我嘗試過的東西仍然有效:

listB = []
for item in listA:
    if item not in stopwordlist:
        listB.append(item)
    else:
        continue
....            
....
    return listB

我們可以拆分item然后檢查停用詞列表中的項目。 但這似乎是許多解決方法。 或者我可以使用正則表達式re.match

這是一種方法,

>>> listA = ['New Delhi', 'Moscow', 'Berlin', 'France', 'To Washington']
>>> stopwordlist = ['new', 'To']
>>> finalList = [i for i in listA if not any(j.lower() in i.lower() for j in stopwordlist)]
>>> finalList
['Moscow', 'Berlin', 'France']

或者您可以使用內置的filter功能。

>>> listA = ['New Delhi', 'Moscow', 'Berlin', 'France', 'To Washington']
>>> stopwordlist = ['new', 'To']
>>> list(filter(lambda x: not any(j.lower() in x.lower() for j in stopwordlist), listA))
['Moscow', 'Berlin', 'France']
sl = tuple(i.lower() for i in stopwordlist)
[i for i in listA if not i.lower().startswith(sl)]

產量

['Moscow', 'Berlin', 'France']
listA =['New Delhi','Moscow', 'Berlin','France', 'To Washington']
stopwordlist = ['new','To']
listA = [i.lower() for i in listA]
stopwordlist = [i.lower() for i in stopwordlist]

listB =[]

for item in listA:
    flag = True
    for i in item.split(' '):
        if i in stopwordlist:
            flag =False
    if flag:
        listB.append(item)
print(listB)

您必須降低停用詞,也要降低針對您的詞使您停用詞:

listA = ['New Delhi', 'Moscow', 'Berlin', 'France', 'To Washington']
stopwordlist = ['new', 'To']

stop_words = {e.lower() for e in stopwordlist}
finalList = [e for e in listA if not stop_words.intersection(e.lower().split())]

或者您可以使用正則表達式:

import regex as re

stop_words_regex = re.compile(r"\L<words>", words=stop_words)
finalList = [e for  e in listA if not stop_words_regex.findall(e.lower())]

輸出:

['Moscow', 'Berlin', 'France']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM