[英]How to filter a row based on multiple indexes and multiple conditions?
I have a file which looks like this:我有一个看起来像这样的文件:
#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,GGG,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,GGG,Madrid,Barcelona,1,1,1,1
2020-09-07T00:00:03.230+02:00
2020-09-07T00:00:03.230+02:00
Index[2]
in each row shows how much cities are present in that specific row.每行中的
Index[2]
显示该特定行中存在多少城市。 So the first row has value 3 for index[2]
, which are London, Manchester, London.
所以第一行的
index[2]
的值为 3,分别是London, Manchester, London.
I am trying to do the following:我正在尝试执行以下操作:
For every row I need to check if any of row [3]
+ the cities mentioned after it (based on the number of cities) are present in cities_to_filter
.对于每一行,我需要检查是否有任何
row [3]
+ 之后提到的城市(基于城市数量)存在于cities_to_filter
中。 But this only needs to be done if row[2] is a number.但这仅在 row[2] 是数字时才需要完成。 I also need to tackle the fact that some rows contain less then 2 items.
我还需要解决一些行包含少于 2 个项目的事实。
This is my code:这是我的代码:
path = r'c:\data\ELK\Desktop\test_data_countries.txt'
cities_to_filter = ['Sevilla', 'Manchester']
def filter_row(row):
if row[2].isdigit():
amount_of_cities = int(row[2]) if len(row) > 2 else True
cities_to_check = row[3:3+amount_of_cities]
condition_1 = any(city in cities_to_check for city in cities_to_filter)
return condition_1
with open (path, 'r') as output_file:
reader = csv.reader(output_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)
Right now I receive the following error:现在我收到以下错误:
UnboundLocalError: local variable 'condition_1' `referenced before assignment`
You could do something like this:你可以这样做:
import sys
def filter_row(row):
'''Returns True if the row should be removed'''
if len(row) > 2:
if row[2].isdigit():
amount_of_cities = int(row[2])
cities_to_check = row[3:3+amount_of_cities]
else:
# don't have valid city count, just try the rest of the row
cities_to_check = row[3:]
return any(city in cities_to_check for city in cities_to_filter)
print(f'Invalid row: {row}', file=sys.stderr))
return True
with open (path, 'r') as input_file:
reader = csv.reader(input_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):
print(row)
In filter()
the row length is checked to ensure that a possible city count in row[2]
is present.在
filter()
中检查行长度以确保存在row[2]
中可能的城市计数。 If the count is a number it is used to calculate the upper bound to extract the cities to check.如果计数是一个数字,则它用于计算提取要检查的城市的上限。 Otherwise the row from index 3 to the end of the row is processed which will include the additional number values, but probably not city names.
否则,从索引 3 到行尾的行将被处理,这将包括额外的数值,但可能不包括城市名称。
If there are too few fields the row it is filtered by returning True
and an error message is printed.如果字段太少,则通过返回
True
过滤该行并打印错误消息。
I suggest you to filter before to optimize everything.我建议您先过滤以优化所有内容。 Here the beginning of the path you should explore:
这里是您应该探索的路径的开始:
test_data = pd.DataFrame({'ID':['ID-10','ID-10','ID-20','ID-20','ID-30','ID-30','ID-40'],'id':[3,3,2,2,3,'GGG','GGG'],'cities':[['London','Manchester','London',1,1,1],['London','Manchester','London',1,1],['London','London',1,1],['London','London',1,1],['Madrid','Sevilla','Sevilla',1,1,1],['Madrid','Sevilla','Sevilla',1],['Madrid','Barçelona',1]]})
cities_to_filter = ['Sevilla', 'Manchester']
_condition1 = test_data.index.isin(test_data[test_data.id.str.isnumeric() != False][test_data[test_data.id.str.isnumeric() != False].id > 2].index)
test_data['results'] = np.where( _condition1,1,0)
test_data
OUTPUT: OUTPUT:
And then you apply an 'any() in' for filtering the cities, but there are a lot of ways.然后你应用“any() in”来过滤城市,但有很多方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.