简体   繁体   English

匹配子字符串和字符串列表,如果匹配则返回子字符串

[英]Match list of substrings and strings and return substring if it matches

I've seen may questions on this topic but most are the opposite of mine. 我见过关于这个话题的疑问,但大多数与我相反。 I have a list of strings (column of a data frame) and a list of sub strings. 我有一个字符串列表(一个数据帧的列)和一个子字符串列表。 I want to compare each string to the list of sub strings If it contains a sub string then return that sub-string else print 'no match'. 我想将每个字符串与子字符串列表进行比较如果它包含一个子字符串,则返回该子字符串,否则打印“不匹配”。

    subs = [cat, dog, mouse]

    df

      Name       Number     SubMatch
     dogfood      1           dog
     catfood      3           cat
     dogfood      2           dog
     mousehouse   1           mouse
     birdseed     1           no match

my current output looks like this though: 我目前的输出看起来像这样:

     Name       Number     SubMatch
     dogfood      1           dog
     catfood      3           dog
     dogfood      2           dog
     mousehouse   1           dog
     birdseed     1           dog

I suspect my code is just returning the first thing in the series, how do I change that to the correct thing in the series? 我怀疑我的代码只是返回该系列中的第一件事,如何将其更改为该系列中的正确内容? Here is the Function: 这是功能:

    def matchy(col, subs):
        for name in col:
            for s in subs:
                if any(s in name for s in subs):
                    return s
                else:
                    return 'No Match'

The pandaic way to solve this would be to not use loops at all. 解决这个问题的一般方法是根本不使用循环。 You could do this pretty simply with str.extract : 您可以使用str.extract完成此str.extract

p = '({})'.format('|'.join(subs))
df['SubMatch'] = df.Name.str.extract(p, expand=False).fillna('no match')

df

         Name  Number  SubMatch
0     dogfood       1       dog
1     catfood       3       cat
2     dogfood       2       dog
3  mousehouse       1     mouse
4    birdseed       1  no match

How about this: 这个怎么样:

def matchy(col, subs):
    for name in col:
        try:
            return next(x for x in subs if x in name)
        except StopIteration:
            return 'No Match'

The problem with your code was that you were checking for matches with any but returning the first item of the iteration first ( dog ). 代码的问题在于,您正在检查是否与any对象匹配,但首先返回迭代的第一项( dog )。


EDIT kudos @Coldspeed 编辑荣誉@Coldspeed

def matchy(col, subs):
    for name in col:
        return next(x for x in subs if x in name, 'No match')

I think you are over complicating things with a nested loop then the any test inside. 我认为您是通过嵌套循环然后再进行内部any测试来使事情复杂化。 Would this work better: 这样做会更好吗:

def matchy(col, subs):
        for name in col:
            for s in subs:
                if s in name:
                    return s
                else:
                    return 'No Match'

Unless there is code missing that accounts for it, it would appear that your code returns the result for the very first comparison, and actually does not look at any of the other items in the col list. 除非有缺少代码的代码,否则您的代码似乎会返回第一次比较的结果,而实际上不会查看col列表中的任何其他项目。 If you would rather stick with nested loops, I would suggest modifying your code like so: 如果您宁愿坚持使用嵌套循环,建议您像这样修改代码:

def matchy(col, subs):
    subMatch = []
    for name in col:
        subMatch.append('No Match')
        for s in subs:
            if s in name:
                subMatch[-1] = s
                break
    return subMatch

This assumes that col is a list of strings containing the column information (dogfood, mousehouse, etc) and that subs is a list of strings containing the substrings you wish to search for. 假定col是包含列信息(dogfood,mousehouse等)的字符串的列表,并且subs是包含要搜索的子字符串的字符串的列表。 subMatch is a list of strings returned by matchy that contains the search results for each item in col . subMatch是返回字符串列表matchy包含在每个商品的搜索结果col

For each value in col we examine, we append the 'No Match' string to subMatch, basically assuming we did not find a match. 对于我们检查的col每个值,我们将'No Match'字符串附加到subMatch,基本上假设我们没有找到匹配项。 Then we iterate through subs , checking to see if the substring s is contained within name . 然后我们遍历subs ,检查子串s是否包含在name If there is a match, then subMatch[-1] = s replaces the most recent 'No Match' we appended with the matching substring, then we break to move onto the next item in col since we don't need to search for any more values. 如果存在匹配项,则subMatch[-1] = s用匹配的子字符串替换我们附加的最新'No Match' ,然后由于无需搜索任何内容,因此我们跳到col的下一项更多的价值。 Note that subMatch[-1] = s can be replaced with other methods, such as doing subMatch.pop() followed by subMatch.append(s) , though at that point I think it is more personal preference. 请注意, subMatch[-1] = s可以用其他方法代替,例如先执行subMatch.pop()然后再执行subMatch.append(s) ,尽管在这一点上我认为这是更个人的偏爱。 Once all elements in col have been checked, subMatch is returned, at which point you can then process it however you like. 检查col所有元素后,将返回subMatch ,此时您可以根据需要进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM