简体   繁体   English

从字符串列表中提取子字符串列表

[英]Extract a list of substrings from a list of strings

I have a list of product names that have a consistent structure.我有一个结构一致的产品名称列表。

product_names = ['brand_name1 product1', 'brand_name2 product2', 'brand_name1 product3', 'brand_name3 product4']

Also, I scraped a list of brands from the site filter:另外,我从网站过滤器中抓取了一个品牌列表:

brand_names = ['brand_name1','brand_name2','brand_name3']

Each element in [brand_names] can be found in several elements in [product_names] because several products can belong to the same brand. [brand_names] 中的每个元素都可以在 [product_names] 中的多个元素中找到,因为多个产品可以属于同一个品牌。

Output: Output:

I would like to extract brand_names from product_names and get a.csv file with two columns: Brand, Product.我想从product_names 中提取brand_names 并获得一个包含两列的.csv 文件:品牌、产品。

Solution:解决方案:

Thanks everyone, I tried to use list comprehension myself.谢谢大家,我自己尝试使用列表理解。 but coded it totally wrong.但编码完全错误。

import pandas as pd

product_names = ['brand_name1 product1', 'brand_name2 product2','brand_name1 product3']
brand_names = ['brand_name1','brand_name2','brand_name3']

brands = [i for j in product_names for i in brand_names if i in j]

result = pd.DataFrame(
    {'Brand': brands,
     'Product': product_names
     })

result.to_csv('result.csv', index=False)

You can just try this:你可以试试这个:

for brand in brand_names:
    for product in product_names:
        if (brand in product):
            print(brand,product)

Or an other solution is to use generator:或者另一种解决方案是使用生成器:

matching=[]
for brand in brand_names:
    matching.append([product for product in product_names if brand in product])

Hope this is what you are looking for:希望这是您正在寻找的:

data = []
for i in product_names:
    data.append(i.split())
df = pd.DataFrame(data, columns=["Brand", "Product"])
df.to_csv(csv_file_name)

Output: Output:

    Brand       Product
0   brand_name1 product1
1   brand_name2 product2
2   brand_name1 product3
3   brand_name3 product4

In most general case where there is not pattern in your product name, use this:在大多数情况下,如果您的产品名称中没有模式,请使用:

product_brand = [i for j in product_names for i in brand_names if i in j]

But if there is a pattern, you should leverage that to speed up the process.但是,如果存在某种模式,您应该利用它来加快流程。

output: output:

product_brand
['brand_name1', 'brand_name2', 'brand_name1', 'brand_name3']

And to write as columns into a csv file, use this:并作为列写入 csv 文件,使用这个:

import csv

rows = zip(product_names,product_brand)
with open('file.txt', "w") as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)

output: output:

brand_name1 product1,brand_name1
brand_name2 product2,brand_name2
brand_name1 product3,brand_name1
brand_name3 product4,brand_name3

Here's an approach:这是一种方法:

Assuming your brand names is a list of unique brands you can try this:假设您的品牌名称是独特品牌的列表,您可以尝试以下操作:

import pandas as pd

# Brand and Product lists
product_names = ['brand_name1 product1', 'brand_name2 product2', 'brand_name1 product3', 'brand_name3 product4']
brand_names = ['brand_name1','brand_name2','brand_name3']

# Empty list to save the results
res_ls = []

# Iterate over each brand
for b in brand_names:
    # Select products for your current brand
    brand_prducts = [i for i in product_names if b in i]
    res_ls.append({
        'brand': b,
        'products': ', '.join(brand_prducts)
    })

# Get result as a pandas dataframe
res_df = pd.DataFrame(res_ls)

# Save your dataframe to csv
res_df.to_csv('/path/to/save', index=False)

This is what the pandas dataframe will look like:这就是 pandas dataframe 的样子: 在此处输入图像描述

This is my little version.这是我的小版本。 Makes a bit of sorting first and assumes the possibility of white spaces in brand and product name.首先进行一些排序,并假设品牌和产品名称中可能存在空格。

Sorting makes things easier an nicer.排序使事情变得更容易更好。 Use of strip() to avoid problems due to white spaces.使用strip()来避免由于空格引起的问题。 However, if the product name has white spaces and by accident some are doubled, this is considered a different brand name.但是,如果产品名称中有空格,并且不小心出现了两倍,则将其视为不同的品牌名称。 To handle this one might need regular expressions.要处理这个可能需要正则表达式。

product_names = ['brand_name1   product1', 'brand name2 product2', 'brand_name1 product 3', ' brand_name3 product4', 'brand name2 product 2']
prbrand_names = ['brand_name1','brand name2','brand_name3']

product_names = sorted( [ s.strip() for s in product_names ] )
prbrand_names = sorted( [ s.strip() for s in prbrand_names ])

with open( "out.csv", "wb") as fpntr:
    cnt = 0
    for bn in prbrand_names:
        # second case is not tested if first is already false -> no IndexError
        while cnt < len( product_names ) and product_names[cnt].startswith( bn ):
            
            pn = product_names[cnt][len( bn ) : ]
            # pn might have unnecessary spaces that can be stripped
            fpntr.write( "{}, {}\n".format( bn, pn.stip() ) )
            cnt += 1

out.csv is: out.csv 是:

brand name2, product 2
brand name2, product2
brand_name1, product1
brand_name1, product 3
brand_name3, product4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM