简体   繁体   English

检查字符串是否有特定的子串格式,如何...?

[英]Check string for specific format of substring, how to..?

Two strings. 两个字符串。 My items name: 我的商品名称:

Parfume name EDT 50ml 香水名称EDT 50ml

And competitor's items name: 而竞争对手的物品名称:

Parfume another name EDP 60ml Parfume另一个名字EDP 60ml

And i have a long list of these names in one column, competitors names in other column, and I want to leave only those rows in dataframe, that have same amount of ml in both my and competitors names no matter what everything else in these strings look like. 我在一列中列出了这些名称的长列表,其他列中的竞争对手名称,我想只保留数据帧中的那些行,无论我和竞争对手的名字中的所有其他内容都是相同的看起来像。 So how do I find a substring ending with 'ml' in a bigger string? 那么如何在更大的字符串中找到以'ml'结尾的字符串? I could simply do 我可以干脆做

"**ml" in competitors_name

to see if they both contain the same amount of ml. 看看它们是否含有相同量的ml。

Thank you 谢谢

UPDATE UPDATE

'ml' is not always at the end of string. 'ml'并不总是在字符串的末尾。 It might look like this 它可能看起来像这样

Parfume yet another great name 60ml EDP Parfume又名60ml EDP

Try this: 尝试这个:

import re

def same_measurement(my_item, competitor_item, unit="ml"):
    matcher = re.compile(r".*?(\d+){}".format(unit))
    my_match = matcher.match(my_item)
    competitor_match = matcher.match(competitor_item)
    return my_match and competitor_match and my_match.group(1) == competitor_match.group(1)

my_item = "Parfume name EDT 50ml"
competitor_item = "Parfume another name EDP 50ml"
assert same_measurement(my_item, competitor_item)

my_item = "Parfume name EDT 50ml"
competitor_item = "Parfume another name EDP 60ml"
assert not same_measurement(my_item, competitor_item)

You could use the python Regex library to select the 'xxml' values for each of your data rows and then do some logic to check if they match. 您可以使用python Regex库为每个数据行选择'xxml'值,然后执行一些逻辑来检查它们是否匹配。

import re

data_rows = [["Parfume name EDT", "Parfume another name EDP 50ml"]]

for data_pairs in data_rows:
    my_ml = None
    comp_ml = None

    # Check for my ml matches and set value
    my_ml_matches = re.search(r'(\d{1,3}[Mm][Ll])', data_pairs[0])
    if my_ml_matches != None:
        my_ml = my_ml_matches[0]
    else:
        print("my_ml has no ml")

    # Check for comp ml matches and set value
    comp_ml_matches = re.search(r'(\d{1,3}[Mm][Ll])', data_pairs[1])     
    if comp_ml_matches != None:
        comp_ml = comp_ml_matches[0]
    else:
        print("comp_ml has no ml")

    # Print outputs
    if (my_ml != None) and (comp_ml != None):
        if my_ml == comp_ml:
            print("my_ml: {0} == comp_ml: {1}".format(my_ml, comp_ml))
        else:
            print("my_ml: {0} != comp_ml: {1}".format(my_ml, comp_ml))

Where data_rows = each row in the data set 其中data_rows =数据集中的每一行

Where data_pairs = {your_item_name, competitor_item_name} data_pairs = {your_item_name,competitor_item_name}

You could use a lambda function to do that. 您可以使用lambda函数来执行此操作。

import pandas as pd
import re
d = {
    'Us':
        ['Parfume one 50ml', 'Parfume two 100ml'],
    'Competitor':
        ['Parfume uno 50ml', 'Parfume dos 200ml']
}
df = pd.DataFrame(data=d)

df['Eq'] = df.apply(lambda x : 'Yes' if re.search(r'(\d+)ml', x['Us']).group(1) == re.search(r'(\d+)ml', x['Competitor']).group(1) else "No", axis = 1)

Result: 结果:

在此输入图像描述

Doesn't matter whether 'ml' is in the end of in the middle of the string. 无论'ml'是否在字符串中间的末尾都无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM