简体   繁体   English

从列表中删除数字(如果未包含在其他列表的子字符串中)

[英]Remove numbers from list, if not contained in substring of other list

Here's my situation: 这是我的情况:

I have one list of product names such as: 我有一个产品名称列表,例如:
BLUEAPPLE, GREENBUTTON20, 400100DUCK20 (len = 9000) BLUEAPPLE, GREENBUTTON20, 400100DUCK20 (len = 9000)
and a list of official item names such as: 以及官方商品名称列表,例如:
BLUEAPPLE, GREENBUTTON, 100DUCK . BLUEAPPLE, GREENBUTTON, 100DUCK (len = 2700) (len = 2700)

As I'll be applying fuzzy string matching to product - items, I want to strip away the unnecessary numbers from the product names -- but keep numbers that are represented in official item names. 由于我将模糊字符串匹配应用于产品-项目,因此我想从产品名称中删除不必要的数字-但要保留正式项目名称中表示的数字。

I came up with a solution, but the issue is it works very slowly. 我想出了一个解决方案,但问题是它的运行速度非常慢。

def remove_nums(product):
    if bool(re.search('\d'), product):
        for item in item_nums_list:
            if item in product_name:
                substrings = [u for x in product_name.split(item) for u in (x, item)][:-1]
                no_num_list = [re.sub('(\d+)', '', substring) if substring not in item else substring for substring in substrings]
                return ''.join(no_num_list)
        return re.sub('(\d+)', '', product)
    else:
        return product

Example: 例:

product_name = '400100DUCK20'
item = '100DUCK'
substrings = ['400','100DUCK','20']
no_num_list = ['','100OG','']
returns '100DUCK'

This function is mapped so that it's looping over every product in the product list. 映射了此函数,以便它遍历产品列表中的每个产品。

I've been trying to figure out a way to use lambdas here, maps, applys, etc, but can't quite wrap my head around it. 我一直在尝试找出一种在这里使用lambda的方法,映射,应用等,但是无法完全解决。 What would be the most efficient way to accomplish what I am trying to do, either with straight lists, or in pandas? 用直线清单或熊猫来完成我想做的事情的最有效方法是什么? Alternatively, I'm getting these item and product lists from a postgres database, so if you think it'd be faster to do in psql I'd go that route. 或者,我从postgres数据库中获取这些项目和产品列表,因此,如果您认为在psql中执行该操作会更快,那我就走那条路。

difflib.get_close_matches() will at least help clean up your code and will probably run faster. difflib.get_close_matches()至少将帮助清理您的代码,并且可能运行得更快。

import difflib
p_names = ['BLUEAPPLE', 'GREENBUTTON20', '400100DUCK20']
i_names = ['BLUEAPPLE', 'GREENBUTTON', '100DUCK']
for p in p_names:
    print(difflib.get_close_matches(p, i_names))

>>> 
['BLUEAPPLE']
['GREENBUTTON']
['100DUCK']
>>> 

There are still going to be a lot of comparisons taking place, it has to match every string in p_names to every string in i_names. 仍然会有很多比较,它必须将p_names中的每个字符串与i_names中的每个字符串进行匹配。


Similar to your approach using a regular expressions to find a match: 与使用正则表达式查找匹配项的方法类似:

import re
for p in p_names:
    for i in i_names:
        if re.search(i, p):
            print(i)
            # stop looking
            break

Try this: 尝试这个:

def remove_nums(product):
    if re.search('\d', product):
        for item in item_nums_list:
            if item in product:
                return item
        return re.sub('(\d+)', '', product)
else:
    return product

Also, make sure you are using the normal python interpreter. 另外,请确保您使用的是普通的python解释器。 IPython and other interpreters with debugging features are a LOT slower than the regular interpreter. IPython和其他具有调试功能的解释器比常规解释器慢很多。

You might want to consider doing some set operations first though. 您可能需要考虑先进行一些设置操作。 Here's a little example: 这是一个小例子:

product_set = set(product_list)
item_number_set = set(item_number_list)

# these are the ones that match straight away
product_matches = product_set & item_number_set

# now we can search through the substrings of ones that don't match
non_matches = product_set - item_number_set
for product in non_matches:
    for item_number in item_number_set:
        if item_number in product:
            product_matches.add(product)
            break

# product_matches is now a set of all unique codes contained in both lists by "fuzzy match"
print(product_matches)

You kind of lose the order in which they appeared, but maybe you can find a way to modify this for your use. 您可能会失去它们出现的顺序,但是也许您可以找到一种方法来对其进行修改以供使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何比较 2 个列表并从 1 个列表中删除包含其他列表中的 substring 的字符串? Python - How can I compare 2 list and remove the a string from 1 list that contain a substring from other list? Python 如果从子字符串列表中删除列表中的字符串 - Remove string from list if from substring list 如果元素包含在列表中,则从列表中删除元组 - remove tuple from list if element is contained within it 搜索列表中的元素是否在其他列表中至少包含一次 - Search if elements from list are contained at least once in other list 在列表中查找其他列表中不包含的矩形的最快方法 - fastest way to find rectangles in list not contained by other rectangles from list 从列表(或 DataFrame)中删除包含相同列表的子字符串 - Remove from a list (or DataFrame) substrings contained the same list 仅对列表中包含的数字求和 - summing only the numbers contained in a list Python - 从列表中的字符串元素中删除子字符串? - Python - Remove substring from string element in a list? 如何有条件地从字符串列表中删除子字符串? - How to remove a substring conditionally from a list of strings? 使用 Pandas 从字符串列表中删除 substring - Remove a substring from a list of string using Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM