简体   繁体   English

清理数据:如何遍历列表查找项目是否包含字符串,空格或空格,然后在Python中删除该项目

[英]Cleaning data: How to iterate through a list find if item contains a string, whitespace or blank and delete that item in Python

I am trying to iterate through a list of data to clean it up. 我正在尝试遍历数据列表以进行清理。

Here's a small part of the list: 这只是列表的一小部分:

lines =['Wirkstoffliste 1 –  ','','  ', 'Gaschromatographie (GC) ', 'LOQ ', '[mg/kg] ', 'Acibenzolar-S-methyl', 'Aclonifen', 'Acrinathrin', 'Alachlor', 'Aldrin', 'Allethrin', 'Ametryn', 'Antrachinon', 'Atrazin', 'Atrazin-desethyl', 'Atrazin-desisopropyl', 'Azinphos (-ethyl)', 'Azinphos-methyl', 'Benalaxyl', 'Benfluralin', 'Benzoylprop-ethyl',' Seite 13 von 14 ', '   ', ' ', ' ', 'Wirkstoffliste 4 - ','Version 7.2 ']

I want to remove any list item that contains the words "Version", "Seite" and "Wirkstoffliste". 我想删除任何包含单词“ Version”,“ Seite”和“ Wirkstoffliste”的列表项。 You will also see there are some strings that are either blank or contain just white space (of various lengths). 您还将看到有些字符串为空白或仅包含空格(各种长度)。

I have already cleaned this data up quite a lot with regex, but now I just want the chemical names. 我已经使用正则表达式清理了很多数据,但是现在我只想要化学名称。 There are some other items that keep coming up that I don't want, eg "Version" but they are never quite the same, so it might be "Version 7. 2" or "Version 8.1". 还有一些我不希望出现的其他项目,例如“版本”,但它们从未完全相同,因此可能是“版本7、2”或“版本8.1”。 Therefore I thought if I tried "If 'Version' in string", this would find it within the string, then I could choose to delete it. 因此,我认为如果尝试“如果字符串中的'Version'”,它将在字符串中找到它,那么我可以选择将其删除。 However this doesn't seem to work. 但是,这似乎不起作用。

Do I really need to use regex with this too? 我真的也需要使用正则表达式吗?

Here's a bunch of stuff I tried. 这是我尝试过的一堆东西。

I have tried if string in item. 我试过如果项目中的字符串。

if "Wirkstoffliste" in item:
    lines.remove(item)

I have tried using OR logic so I could put more search strings in there. 我尝试使用OR逻辑,以便可以在其中放置更多搜索字符串。 eg 例如

if "Seite" or "Wirkstoffliste" or "Version" in item:
    lines.remove(item)

I used both enumerate with del and and if in statement, eg 我既使用del枚举,又使用if语句,例如

for n,item in enumerate(lines):
    if "Wirkstoffliste" in item:
        del lines[n]

And finally I tried using a list of search strings: 最后,我尝试使用搜索字符串列表:

removables=["Seite","Version","Wirkstoffliste","Gaschromatographie","LOQ"]

for line in lines:
    for r in removables:
        if r in line:
            lines.remove(line)

To delete the blanks and white spaces I have tried: 要删除空白和空白,我尝试过:

"""delete empty items"""
lines = list(filter(None, lines))
lines = list(filter(bool,lines))

and

for item in lines:
    if item=="" or " ":
        lines.remove(item)

I have found none of the above works, so I am a little confused what I am doing wrong. 我没有发现上述作品,因此我对自己所做的事情感到困惑。

here is a solution: i am using filter and any 这是一个解决方案:我正在使用过滤器任何

l1 = ['Wirkstoffliste', 'Seite','Version']
#i am with lines[:] (slicing) to play with the fact a list is mutable
lines[:] = list(filter(str.strip,lines)) #suppress items whitespace or empty
lines[:] = [x for x in lines if not any(sub in x for sub in l1)]

# you could write these lines too if using a new list:
#lines = list(filter(str.strip,lines))
#lines = [x for x in lines if not any(sub in x for sub in l1)]
print(lines)

output: 输出:

['Gaschromatographie (GC) ', 'LOQ ', '[mg/kg] ', 'Acibenzolar-S-methyl', 
 'Aclonifen', 'Acrinathrin', 'Alachlor', 'Aldrin', 'Allethrin', 'Ametryn', 
 'Antrachinon', 'Atrazin', 'Atrazin-desethyl', 'Atrazin-desisopropyl', 
 'Azinphos (-ethyl)', 'Azinphos-methyl', 'Benalaxyl', 
 'Benfluralin', 'Benzoylprop-ethyl']

Another way to write the coding using filter: filter keeps the data if return function is True 使用过滤器编写编码的另一种方法:如果返回函数为True,过滤器将保留数据

def remove_whitespaces_and_items(item):
    if item.strip() == '': return False # if item is blank, dont keep
    for x in l1:
        if x in item:
            return False                # if item of l1 is in lines, dont keep
    return True                         # item is not blank and not in l1, so keep it

lines =list(filter(remove_whitespaces_and_items,lines))

I'm just aa simple man, and going with what you tried, I did a code I think more human readable: 我只是一个简单的人,按照您的尝试,我做了一段代码,使我更容易理解:

words = ['Wirkstoffliste', 'Seite', 'Version', '  ']
new_lines = []
for item in lines:
    if not (any(word in item for word in words)):
        if item != "" and item != " ":
            new_lines.append(item)

You can add anything in words. 您可以在文字中添加任何内容。 (I just inserted 2 blank spaces to catch 2-3-4 spaces fields). (我只是插入2个空格来捕获2-3-4个空格字段)。 I Think for the lines you provided and purpose you wanted, "mg/kg" would be one. 我认为对于您提供的线路和想要的用途,“ mg / kg”将是其中之一。

By the way Frenchy version is surely better and more elegant. 顺便说一句,法语版本肯定更好,更优雅。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM