更智能的方法来检查字符串是否包含列表中的元素 - python

[英]Smarter way to check if a string contains an element in a list - python


top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]

items是一个pandas.DataFrame ,结构如下所示。 我的任务是填补brand_nameitem_title如果brand_name缺失

row     item_title                 brand_name

1    |  Apple 6S                  |  Apple
2    |  New Victoria\'s Secret    |  missing  <-- need to fill with Victoria\'s Secret
3    |  Used Samsung TV           |  missing  <--need fill with Samsung
4    |  Used bike                 |  missing  <--No need to do anything because there is no brand_name in the title 

我的代码如下。 问题是,对于包含200万条记录的数据帧来说, 它太慢了。 我可以用pandas或numpy来处理任务吗?

def get_brand_name(row):
    if row['brand_name'] != 'missing':
        return row['brand_name']

    item_title = row['item_title']

    for brand in top_brands:
        brand_start = brand + ' '
        brand_in_between = ' ' + brand + ' '
        brand_end = ' ' + brand
        if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)): 
            return brand

    return 'missing'    ### end of get_brand_name

items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)


pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)


              item_title         brand_name
0               Apple 6S              Apple
1  New Victoria's Secret  Victoria's Secret
2        Used Samsung TV            Samsung
3              Used Bike            missing


def read_file():
    df = pd.read_csv('file1.txt')
    new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
    return new_df

start = time.time()
end = time.time() - start
print(f'Took {end}s to process')


                                   item_title         brand_name
0                                    LG watch                 LG
1                                  Sony watch               Sony
2                                 Used Burger            missing
3                                    New Bike            missing
4                               New underwear            missing
5                                    New Sony               Sony
6                        Used Apple underwear              Apple
7                       Refurbished Panasonic          Panasonic
8                   Used Victoria's Secret TV  Victoria's Secret
9                                Disney phone             Disney
10                                Used laptop            missing
...                                       ...                ...
1999990             Refurbished Disney tablet             Disney
1999991                    Refurbished laptop            missing
1999992                       Nintendo Coffee           Nintendo
1999993                      Nintendo desktop           Nintendo
1999994         Refurbished Victoria's Secret  Victoria's Secret
1999995                           Used Burger            missing
1999996                    Nintendo underwear           Nintendo
1999997                     Refurbished Apple              Apple
1999998                      Refurbished Sony               Sony
1999999                      New Google phone             Google

[2000000 rows x 2 columns]
Took 3.2660000324249268s to process


Windows 7 Pro 64位Intel i7-4770 @ 3.40GHZ 12.0 GB RAM




['New','New Victoria\'s', 'New Victoria\'s Secret', 'Victoria\'s', 'Victoria\'s Secret', 'Secret']


如果您预计拼写错误,则会对您的品牌列表进行三元索引,然后将item_title块的值分解为三元组,并根据三元组索引对它们进行评分。 或者你可以在块上使用levenshtein距离,具有一定的n个容差步长,以防止真正的不匹配。


top_brands = [r'Coca Cola', r'Apple', r'Victoria\'s Secret', r'Samsung']

df = pd.DataFrame({
         'item_title': ['Apple 6S', 'New Victoria\'s Secret', 'Used Samsung TV', 'Used bike'],
         'brand_name': ['Apple', 'missing', 'missing', 'missing']
         }, columns=['item_title' ,'brand_name'])

#               item_title brand_name
# 0               Apple 6S      Apple
# 1  New Victoria's Secret    missing
# 2        Used Samsung TV    missing
# 3              Used bike    missing

# concatenate brand names into regex string
# with each brand as a capture group
top_brands = '|'.join(['(' + x + ')'  for x in top_brands])

# "(Coca Cola)|(Apple)|(Victoria\\'s Secret)|(Samsung)"

df.loc[:, 'brand_name'] = df['item_title'].str.extract(\ 
                          top_brands).fillna('').sum(axis=1).replace('', 'missing')

#               item_title         brand_name
# 0               Apple 6S              Apple
# 1  New Victoria's Secret  Victoria's Secret
# 2        Used Samsung TV            Samsung
# 3              Used bike            missing


import pandas as pd
import time
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', 'Samsung']
items = pd.DataFrame(
        [['Apple 6S', 'Apple'],
         ['New Victoria\'s Secret', 'missing'],
         ['Used Samsung TV', 'missing'],
         ['Used bike', 'missing']],
         columns=['item_title', 'brand_name'])
items = pd.concat([items]*500000, ignore_index=True)


''' Code Block 1 '''
items1 = items.copy()
t = time.time()
def get_brand_name_v1(row):
    if row['brand_name'] != 'missing':
        return row['brand_name']
    item_title = row['item_title']
    for brand in top_brands:
        brand_start = brand + ' '
        brand_in_between = ' ' + brand + ' '
        brand_end = ' ' + brand
        if ((brand_in_between in item_title) or \
            item_title.endswith(brand_end) or  \
            return brand
    return 'missing'
items1['brand_name'] = items1.apply(lambda x: get_brand_name_v1(x), axis=1)
print('Code Block 1 time: {:f}'.format(time.time()-t))

# Code Block 1 time: 53.718933

代码的修改版本:使用NAN值通常比使用'missing'字符串比较更快。 另外,根据我的经验,与对整个数据帧进行调用相比,为数据帧中的值直接调用创建临时“指针”要快一些(例如,使用brand_name作为指针而不是调用row ['brand_name']多个次)

''' Code Block 2 '''
items2 = items.copy()
t = time.time()
items2.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
def get_brand_name_v2(row):
    brand_name = row['brand_name']
    if brand_name is not None: return brand_name
    item_title = row['item_title']
    for brand in top_brands:
        if brand in item_title: return brand
items2['brand_name'] = items2.apply(lambda x: get_brand_name_v2(x), axis=1)
items2.loc[:,'brand_name'].fillna('missing', inplace=True)
print('Code Block 2 time: {:f}'.format(time.time()-t))

# Code Block 2 time: 47.940444

灵感来自Idlehands的回答:此版本不会忽略原始数据集的brand_name列中的信息,而只会填充missing值。 你会以这种方式获得速度,但会占用更多内存。

''' Code Block 3 '''
items3 = items.copy()
items3.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
t = time.time()
brands = (items3['item_title'].str.extract(
        '(?P<brand_name>{})'.format("|".join(top_brands)), expand=True))
brands.loc[:,'brand_name'].fillna('missing', inplace=True)
items3.loc[:,'brand_name'].fillna(brands.loc[:,'brand_name'], inplace=True)
print('Code Block 3 time: {:f}'.format(time.time()-t))

# Code Block 3 time: 3.388266



