简体   繁体   English

如何找到数组中的哪些字符串是python中另一个字符串的子字符串?

[英]How to find which strings in an array are substrings to another string in python?

I've got a numpy array of strings (str8192) where the second column is the names of things. 我有一个numpy的字符串数组(str8192),其中第二列是事物的名称。 For the sake of this lets say this array is called thingList. 为此,可以说此数组称为thingList。 I have two strings, string1 and string2. 我有两个字符串,string1和string2。 I'm trying to get a list of every item in the second column of thingList that is in string1 or in string 2. Currently I have this running with a for loop, but I was hoping there was a faster way I don't know about, I'm pretty new to programming. 我正在尝试获取事物列表第二列中位于字符串1或字符串2中的每个项目的列表。目前,我正在使用for循环运行它,但是我希望有一种更快的方法,我不知道关于,我是编程新手。

Once I find a match, I also want to record what is in the first column but the same row as the match. 找到匹配项后,我还想记录第一列中与匹配项相同的行。

Any help to speed this is greatly appreciated, as thingList is pretty large and this functions is run quite a lot with various arrays. 非常感谢您提供任何有助于加快此速度的帮助,因为thingList非常大,并且此函数在各种数组中运行很多。

tempThing = []
tempCode = []

for i in range(thingList.shape[0]):
        if thingList[i][1].lower() in string1.lower() or thingList[i] [1].lower() in string2.lower():
            tempThing.append(thingList[i][1])
            tempCode.append(thingList[i][0])

This code works fine, but it definitely is the bottleneck in my program and is slowing it down a lot. 这段代码可以正常工作,但这绝对是我程序中的瓶颈,并且减慢了速度。

You could use list comprehensions, they are faster than traditional for loops. 您可以使用列表推导,它们比传统的for循环要快。 Furthermore, there are a few minor improvements you could make to make your code run faster : 此外,您可以进行一些小的改进以使代码运行更快:

thing_list = [['Thing1', 'bo'], ['Thing2', 'b'], [ 'Thing3', 'ca'],
              ['Thing4', 'patrick']]*100
string1 = 'bobby'
string2 = 'patrick neils'

# Compute your lower strings before the for loops to avoid
# calling the function at each loop
st1_lower = string1.lower()
st2_lower = string2.lower()

# You can store both the item and the name in the same array to reduce
# the computing time and do it in one list comprehension
result = [[x[0], x[1]] for x in thing_list
          if (x[1].lower() in st1_lower) or (x[1].lower() in st2_lower) ]

Output : 输出:

[['Thing1', 'bo'], ['Thing2', 'b'], ['Thing4', 'patrick']]

Performance : 性能:

For loops : 172 µs ± 9.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 对于循环:每个循环172 µs±9.59 µs(平均±标准偏差,共运行7次,每个循环10000个)

List comprehension : 81.1 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 列表理解:每个循环81.1 µs±2.17 µs(平均值±标准偏差,共运行7次,每个循环10000次)

Numpy arrays will default to iterate over the rows, so no need to do for i in range(...) : Numpy数组将默认在行上进行迭代,因此无需对for i in range(...)

x = np.array(list(range(3)), list(range(3,6)))

for i in x:
    print(i)

[0 1 2]
[3 4 5]

# This yields the same result, so use the former
for i in range(x.shape[0]):
    print(x[i])

[0 1 2]
[3 4 5]

Next, you are spending a ton of time doing str.lower() over and over again. 接下来,您要花费大量时间一次又一次地执行str.lower() I'd probably pre-lower all of your strings ahead of time: 我可能会提前降低所有琴弦:

y = np.array([list('ABC'), list('DEF')])

np.char.lower(y)
array([['a', 'b', 'c'],
       ['d', 'e', 'f']],
      dtype='<U1')

# apply this to string1 and string2
l_str1, l_str2 = string1.lower(), string2.lower()

Now your loop should look like: 现在,您的循环应如下所示:

l_str1, l_str2 = string1.lower(), string2.lower()

for val1, val2 in thingList:
    to_check = val2.lower()

    if to_check in l_str1 or to_check in l_str2:
        tempThing.append(val1)
        tempCode.append(val2)

Now you can apply this to a list comprehension: 现在,您可以将其应用于列表理解:

# you can zip these together so you aren't using str.lower() 
# for all of your if statements
tmp = ((*uprow) for uprow, (a, b) in zip(thingList, np.char.lower(thingList))
       if b in l_str1 or b in l_str2)

# this will unpack pairs
tempThing, tempCode = zip(*tmp)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 什么是最快的算法:在字符串列表中,删除作为另一个字符串的子字符串的所有字符串 [Python(或其他语言)] - What is the fastest algorithm: in a string list, remove all the strings which are substrings of another string [Python (or other language)] 如何在python字符串中的子字符串之间找到子字符串? - How to find substrings between substrings within a python string? 如何从python中的字符串中找到连续的子字符串 - How to find contiguous substrings from a string in python 如何查找字符串中子字符串的出现次数并将其存储到 Python 字典中? - How to find and store the number of occurrences of substrings in strings into a Python dictionary? Python:如何创建一个由另一个字符串列表分割的子字符串列表? - Python: How to create a list by substrings there was splitted by another list of strings? 查找并替换字符串,这些字符串是不同的词,而不是子字符串? - Find and replace strings which are distinct words, but not substrings? 使用python查找字符串中的子字符串 - Find substrings in string using python 如何在两个大的csv文件中找到字符串中的子字符串(Python) - How to find a substrings in string in two big csv file (python) 如何在 python 的示例字符串中找到长度为 k 的所有重叠子字符串 - how to find all overlapping substrings of length k in a sample string in python Python正则表达式:如果有多个子字符串,如何查找字符串? - python regular expression: How to find string if there are multi substrings?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM