简体   繁体   中英

To check the availability of product

  1. The objective is to check the availability of the products in a catalog database.
  2. The product title is cut into multiple items by white-space.
  3. The process is to loop all items in array a through array b without considering NaN value
  4. The score is to count how many time each item available in the catalog out of the total item.

Single Array:

a=['Black', 'Pen','NaN'] #product
b=['Black', 'Book', 'Big'] #catalog


c=[]
for i in a:
    if i != "NaN":
        c.append(i)

matched_count=0    
for i in c:
    if i in b:
        matched_count +=1

matched_count
score = float(matched_count) / len(c)
print(score)

Output:

0.5

I would like to replicate the same process for multiple products like the following. Kindly let me know how to tackle this.

Input - Multiple Array:

products_bag =([['mai', 'dubai', '200ml', 'NaN'],
                ['mai', 'dubai', 'cup'],
                ['mai', 'dubai', '1.5l']]) #multiple products

catalogs_bag =([['natural','mineral','water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'],
                ['2-piece', 'glitzi', 'power', 'inox', 'power', 'dish'],
                ['15-piece', 'bones', 'for', 'dog', 'multicolour', 'rich']]) #bigger catalog

Expected Output:

['mai', 'dubai', '200ml', 'NaN']    -> ['natural','mineral','water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai']    -> 67%

['mai', 'dubai', 'cup']             -> ['natural','mineral','water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai']    -> 67%

 ['mai', 'dubai', '1.5l']           -> ['natural','mineral','water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai']    -> 67%


You can do it like this without numpy or panda with a 2nd for loop nothing fancy:

products_bag =([['Black', 'Pen','NaN'],
    ['Yellow', 'Pen','Small']]) #multiple products

catalogs_bag =([['Black', 'Pen', 'Big'],
    ['Black', 'Pen', 'Small']]) #bigger catalog

def find_distribution(products, catalog):
    item_counter = 0
    matched_count = 0
    for product in products:
        if not "nan" in product.lower():
            item_counter += 1
            if product in catalog:
                matched_count += 1
    if item_counter == 0: # in case products is empty or only have NaN values.
        return 0
    return matched_count / item_counter

for i in range(len(products_bag)):
    print("{} \t-> {} \t-> \t {}%".format(products_bag[i], catalogs_bag[i], round(100*find_distribution(products_bag[i],catalogs_bag[i]),2)))

Output:

['Black', 'Pen', 'NaN']     -> ['Black', 'Pen', 'Big']  ->   100.0%
['Yellow', 'Pen', 'Small']  -> ['Black', 'Pen', 'Small']    ->   66.67%

@Edit

In case you want to use dataframe:

import pandas as pd

# initialize of the two list (read from csv) and the function find_distribution

df = pd.DataFrame(list(zip(products_bag,catalogs_bag)), columns=["products","catalogs"])
df["distribution"] = df.apply(lambda row: 100*round(find_distribution(row["products"],row["catalogs"]),2), axis=1)


for index, row in df.iterrows():
    print("{} \t-> {} \t-> \t {}%".format(row["products"], row["catalogs"], row["distribution"]))

Important : use sensitive names for your variables.


Version for one product, that returns the name of the item, and the score associated to

  1. use of a comprehension list to filter NaN value : shorter
  2. if catalog is 1D, make a 2D catalog => one box : one product ( ["pen"] -> [["pen"]] )
  3. compute the score for each item of the catalog and keep the max one : use a tuple(score, catalog_item) to get both score and name of the corresponding item at the end
def available(product, catalog):
    items = [_ for _ in product if _ != "NaN"]
    if isinstance(catalog[0], str):
        catalog = [catalog]

    max_match = (0, [])
    for catalog in catalog:
        matched_count = 0
        for item in items:
            if item in catalog:
                matched_count += 1
        max_match = max(max_match, (matched_count, catalog)) # tuple score + catalog_item

    return "_".join(items), max_match[1], max_match[0] / len(items)

# USE    
a = ['Black', 'Pen', 'NaN']
b = ['Black', 'Book', 'Big']
print(available(a, b))  # (['Black', 'Pen'], ['Black', 'Book', 'Big'], 0.5)


# Shorter version, using built-in function and list comprehension
def available(product, catalog):
    items = [_ for _ in product if _ != "NaN"]
    if isinstance(catalog[0], str):
        catalog = [catalog]
    max_match = max([(sum([1 for item in items if item in catalog]), catalog) for catalog in catalog])
    return "_".join(items), max_match[1], max_match[0] / len(items)

Multi-product version : apply the one-product version to each

def availables(products, catalog):
    return [available(product, catalog) for product in products]

# USE
a = [['Black', 'Pen', 'NaN'], ['Yellow', 'Pen', 'Small']]
b = [['Black', 'Pen', 'Big'], ['Black', 'Pen', 'Small']]
print(availables(a, b)) 

# (['mai', 'dubai', '200ml'], ['natural', 'mineral', 'water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'], 0.6666666666666666)
# (['mai', 'dubai', 'cup'], ['natural', 'mineral', 'water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'], 0.6666666666666666)
# (['mai', 'dubai', '1.5l'], ['natural', 'mineral', 'water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'], 0.6666666666666666)

To get your formating with arrow just

for res in availables(products_bag, catalogs_bag):
    print(" -> ".join(map(str, res)))


['mai', 'dubai', '200ml'] -> ['natural', 'mineral', 'water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'] -> 0.6666666666666666
['mai', 'dubai', 'cup'] -> ['natural', 'mineral', 'water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'] -> 0.6666666666666666
['mai', 'dubai', '1.5l'] -> ['natural', 'mineral', 'water', 'cups', '200', 'ml', 'pack', 'of', '24', 'mai', 'dubai'] -> 0.6666666666666666

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM