简体   繁体   English

如何根据多个索引和多个条件过滤一行?

[英]How to filter a row based on multiple indexes and multiple conditions?

I have a file which looks like this:我有一个看起来像这样的文件:

#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,GGG,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,GGG,Madrid,Barcelona,1,1,1,1
2020-09-07T00:00:03.230+02:00
2020-09-07T00:00:03.230+02:00

Index[2] in each row shows how much cities are present in that specific row.每行中的Index[2]显示该特定行中存在多少城市。 So the first row has value 3 for index[2] , which are London, Manchester, London.所以第一行的index[2]的值为 3,分别是London, Manchester, London.

I am trying to do the following:我正在尝试执行以下操作:

For every row I need to check if any of row [3] + the cities mentioned after it (based on the number of cities) are present in cities_to_filter .对于每一行,我需要检查是否有任何row [3] + 之后提到的城市(基于城市数量)存在于cities_to_filter中。 But this only needs to be done if row[2] is a number.但这仅在 row[2] 是数字时才需要完成。 I also need to tackle the fact that some rows contain less then 2 items.我还需要解决一些行包含少于 2 个项目的事实。

This is my code:这是我的代码:

path = r'c:\data\ELK\Desktop\test_data_countries.txt'

cities_to_filter = ['Sevilla', 'Manchester']

def filter_row(row):
    if row[2].isdigit():
        amount_of_cities = int(row[2]) if len(row) > 2 else True
        
    cities_to_check = row[3:3+amount_of_cities]
    
    condition_1 =  any(city in cities_to_check for city in cities_to_filter)    
    return condition_1

with open (path, 'r') as output_file:
    reader = csv.reader(output_file, delimiter = ',')
    next(reader)
    for row in reader:
        if filter_row(row):
            print(row)

Right now I receive the following error:现在我收到以下错误:

UnboundLocalError: local variable 'condition_1' `referenced before assignment`

You could do something like this:你可以这样做:

import sys

def filter_row(row):
    '''Returns True if the row should be removed'''
    if len(row) > 2:
        if row[2].isdigit():
            amount_of_cities = int(row[2]) 
            cities_to_check = row[3:3+amount_of_cities]
        else:
            # don't have valid city count, just try the rest of the row
            cities_to_check = row[3:]
        return any(city in cities_to_check for city in cities_to_filter)

    print(f'Invalid row: {row}', file=sys.stderr))
    return True

with open (path, 'r') as input_file:
    reader = csv.reader(input_file, delimiter = ',')
    next(reader)
    for row in reader:
        if filter_row(row):
            print(row)

In filter() the row length is checked to ensure that a possible city count in row[2] is present.filter()中检查行长度以确保存在row[2]中可能的城市计数。 If the count is a number it is used to calculate the upper bound to extract the cities to check.如果计数是一个数字,则它用于计算提取要检查的城市的上限。 Otherwise the row from index 3 to the end of the row is processed which will include the additional number values, but probably not city names.否则,从索引 3 到行尾的行将被处理,这将包括额外的数值,但可能不包括城市名称。

If there are too few fields the row it is filtered by returning True and an error message is printed.如果字段太少,则通过返回True过滤该行并打印错误消息。

I suggest you to filter before to optimize everything.我建议您先过滤以优化所有内容。 Here the beginning of the path you should explore:这里是您应该探索的路径的开始:

test_data = pd.DataFrame({'ID':['ID-10','ID-10','ID-20','ID-20','ID-30','ID-30','ID-40'],'id':[3,3,2,2,3,'GGG','GGG'],'cities':[['London','Manchester','London',1,1,1],['London','Manchester','London',1,1],['London','London',1,1],['London','London',1,1],['Madrid','Sevilla','Sevilla',1,1,1],['Madrid','Sevilla','Sevilla',1],['Madrid','Barçelona',1]]})

cities_to_filter = ['Sevilla', 'Manchester']
_condition1 = test_data.index.isin(test_data[test_data.id.str.isnumeric() != False][test_data[test_data.id.str.isnumeric() != False].id > 2].index)
test_data['results'] = np.where( _condition1,1,0)
test_data

OUTPUT: OUTPUT:

在此处输入图像描述

And then you apply an 'any() in' for filtering the cities, but there are a lot of ways.然后你应用“any() in”来过滤城市,但有很多方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM