简体   繁体   English

根据包含垃圾邮件术语的列表过滤元素

[英]Filter elements from list based on them containing spam terms

So I've made a script that scrapes some sites and builds a list of results. 因此,我制作了一个脚本,该脚本可抓取一些站点并构建结果列表。 Each result has the following structure: 每个结果具有以下结构:

result = {'id': id,
            'name': name,
            'url': url,
            'datetime': datetime,
        }

I want to filter results from the list of results based on spam terms being in the name. 我想根据名称中的垃圾邮件条款从结果列表中过滤结果。 I've defined the following function, and it seems to filter certain results, but not all of them: 我已经定义了以下函数,它似乎可以过滤某些结果,但不是所有结果:

def filterSpamGigsList(theList):
    index = 0
    spamTerms = ['paid','hire','work','review','survey',
                 'home','rent','cash','pay','flex',
                 'facebook','sex','$$$','boss','secretary',
                 'loan','supplemental','income','sales',
                 'dollars','money']
    for i in theList:
        for y in spamTerms:
            if y in i['name'].lower():
                theList.pop(index)
                break        
            index += 1
    return theList

Any clue why this might not be filtering out all results that contain these spam terms? 有什么线索为什么不能将所有包含这些垃圾邮件条款的结果过滤掉? Maybe I need to call .split() on name after calling .lower() as some of the names are phrases? 也许我需要在调用.lower()之后在名称上调用.split(),因为某些名称是短语?

I guess you've got a problem with in-place modifying theList as iterating over it as Jakub suggested. 我猜您在按照Jakub的建议在迭代列表时就地修改List时遇到了问题。

The obious way would be to return a new list. 有趣的方法是返回新列表。 I would split this in two functions for readability: 为了可读性,我将其分为两个函数:

def is_spam(value):
    spam_terms = ['paid','hire','work','review','survey',
                 'home','rent','cash','pay','flex',
                 'facebook','sex','$$$','boss','secretary',
                 'loan','supplemental','income','sales',
                 'dollars','money']
    for term in spam_terms:
        if term in value.lower():
            return True
    return False

def filter_spam_gigs_list(results):
    return [i for i in results if not is_spam(i['name'])]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM