[英]Smarter way to check if a string contains an element in a list - python
列表top_brands
包含品牌列表,例如
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]
items
是一个pandas.DataFrame
,结构如下所示。 我的任务是填补brand_name
从item_title
如果brand_name
缺失
row item_title brand_name
1 | Apple 6S | Apple
2 | New Victoria\'s Secret | missing <-- need to fill with Victoria\'s Secret
3 | Used Samsung TV | missing <--need fill with Samsung
4 | Used bike | missing <--No need to do anything because there is no brand_name in the title
....
我的代码如下。 问题是,对于包含200万条记录的数据帧来说, 它太慢了。 我可以用pandas或numpy来处理任务吗?
def get_brand_name(row):
if row['brand_name'] != 'missing':
return row['brand_name']
item_title = row['item_title']
for brand in top_brands:
brand_start = brand + ' '
brand_in_between = ' ' + brand + ' '
brand_end = ' ' + brand
if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)):
print(brand)
return brand
return 'missing' ### end of get_brand_name
items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)
尝试这个:
pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
输出:
item_title brand_name
0 Apple 6S Apple
1 New Victoria's Secret Victoria's Secret
2 Used Samsung TV Samsung
3 Used Bike missing
我在机器上随机抽取了200万件物品:
def read_file():
df = pd.read_csv('file1.txt')
new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
return new_df
start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')
输出:
item_title brand_name
0 LG watch LG
1 Sony watch Sony
2 Used Burger missing
3 New Bike missing
4 New underwear missing
5 New Sony Sony
6 Used Apple underwear Apple
7 Refurbished Panasonic Panasonic
8 Used Victoria's Secret TV Victoria's Secret
9 Disney phone Disney
10 Used laptop missing
... ... ...
1999990 Refurbished Disney tablet Disney
1999991 Refurbished laptop missing
1999992 Nintendo Coffee Nintendo
1999993 Nintendo desktop Nintendo
1999994 Refurbished Victoria's Secret Victoria's Secret
1999995 Used Burger missing
1999996 Nintendo underwear Nintendo
1999997 Refurbished Apple Apple
1999998 Refurbished Sony Sony
1999999 New Google phone Google
[2000000 rows x 2 columns]
Took 3.2660000324249268s to process
我机器的规格:
Windows 7 Pro 64位Intel i7-4770 @ 3.40GHZ 12.0 GB RAM
3.266秒非常快......对吗?
由于需要识别多字品牌,这是一项NER(命名实体识别)任务。
您需要将item_title中的单词聚类为n个最大长度的块
['New','New Victoria\'s', 'New Victoria\'s Secret', 'Victoria\'s', 'Victoria\'s Secret', 'Secret']
,然后根据您的品牌列表检查群集。
如果您预计拼写错误,则会对您的品牌列表进行三元索引,然后将item_title块的值分解为三元组,并根据三元组索引对它们进行评分。 或者你可以在块上使用levenshtein距离,具有一定的n个容差步长,以防止真正的不匹配。
在我看来,像这样的东西可以工作:
top_brands = [r'Coca Cola', r'Apple', r'Victoria\'s Secret', r'Samsung']
df = pd.DataFrame({
'item_title': ['Apple 6S', 'New Victoria\'s Secret', 'Used Samsung TV', 'Used bike'],
'brand_name': ['Apple', 'missing', 'missing', 'missing']
}, columns=['item_title' ,'brand_name'])
# item_title brand_name
# 0 Apple 6S Apple
# 1 New Victoria's Secret missing
# 2 Used Samsung TV missing
# 3 Used bike missing
# concatenate brand names into regex string
# with each brand as a capture group
top_brands = '|'.join(['(' + x + ')' for x in top_brands])
# "(Coca Cola)|(Apple)|(Victoria\\'s Secret)|(Samsung)"
df.loc[:, 'brand_name'] = df['item_title'].str.extract(\
top_brands).fillna('').sum(axis=1).replace('', 'missing')
# item_title brand_name
# 0 Apple 6S Apple
# 1 New Victoria's Secret Victoria's Secret
# 2 Used Samsung TV Samsung
# 3 Used bike missing
构建具有2M数据点的数据集:
import pandas as pd
import time
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', 'Samsung']
items = pd.DataFrame(
[['Apple 6S', 'Apple'],
['New Victoria\'s Secret', 'missing'],
['Used Samsung TV', 'missing'],
['Used bike', 'missing']],
columns=['item_title', 'brand_name'])
items = pd.concat([items]*500000, ignore_index=True)
定时原始代码作为比较参考:
''' Code Block 1 '''
items1 = items.copy()
t = time.time()
def get_brand_name_v1(row):
if row['brand_name'] != 'missing':
return row['brand_name']
item_title = row['item_title']
for brand in top_brands:
brand_start = brand + ' '
brand_in_between = ' ' + brand + ' '
brand_end = ' ' + brand
if ((brand_in_between in item_title) or \
item_title.endswith(brand_end) or \
item_title.startswith(brand_start)):
return brand
return 'missing'
items1['brand_name'] = items1.apply(lambda x: get_brand_name_v1(x), axis=1)
print('Code Block 1 time: {:f}'.format(time.time()-t))
# Code Block 1 time: 53.718933
代码的修改版本:使用NAN
值通常比使用'missing'
字符串比较更快。 另外,根据我的经验,与对整个数据帧进行调用相比,为数据帧中的值直接调用创建临时“指针”要快一些(例如,使用brand_name
作为指针而不是调用row ['brand_name']多个次)
''' Code Block 2 '''
items2 = items.copy()
t = time.time()
items2.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
def get_brand_name_v2(row):
brand_name = row['brand_name']
if brand_name is not None: return brand_name
item_title = row['item_title']
for brand in top_brands:
if brand in item_title: return brand
items2['brand_name'] = items2.apply(lambda x: get_brand_name_v2(x), axis=1)
items2.loc[:,'brand_name'].fillna('missing', inplace=True)
print('Code Block 2 time: {:f}'.format(time.time()-t))
# Code Block 2 time: 47.940444
灵感来自Idlehands的回答:此版本不会忽略原始数据集的brand_name
列中的信息,而只会填充missing
值。 你会以这种方式获得速度,但会占用更多内存。
''' Code Block 3 '''
items3 = items.copy()
items3.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
t = time.time()
brands = (items3['item_title'].str.extract(
'(?P<brand_name>{})'.format("|".join(top_brands)), expand=True))
brands.loc[:,'brand_name'].fillna('missing', inplace=True)
items3.loc[:,'brand_name'].fillna(brands.loc[:,'brand_name'], inplace=True)
print('Code Block 3 time: {:f}'.format(time.time()-t))
# Code Block 3 time: 3.388266
如果你能负担得起在数据集中使用NAN
而不是'missing'
并删除所有用'missing'
替换NAN
操作,你可以使这些更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.