如何找到数组中的哪些字符串是python中另一个字符串的子字符串？

Question

I've got a numpy array of strings (str8192) where the second column is the names of things. 我有一个numpy的字符串数组（str8192），其中第二列是事物的名称。 For the sake of this lets say this array is called thingList. 为此，可以说此数组称为thingList。 I have two strings, string1 and string2. 我有两个字符串，string1和string2。 I'm trying to get a list of every item in the second column of thingList that is in string1 or in string 2. Currently I have this running with a for loop, but I was hoping there was a faster way I don't know about, I'm pretty new to programming. 我正在尝试获取事物列表第二列中位于字符串1或字符串2中的每个项目的列表。目前，我正在使用for循环运行它，但是我希望有一种更快的方法，我不知道关于，我是编程新手。

Once I find a match, I also want to record what is in the first column but the same row as the match. 找到匹配项后，我还想记录第一列中与匹配项相同的行。

Any help to speed this is greatly appreciated, as thingList is pretty large and this functions is run quite a lot with various arrays. 非常感谢您提供任何有助于加快此速度的帮助，因为thingList非常大，并且此函数在各种数组中运行很多。

tempThing = []
tempCode = []

for i in range(thingList.shape[0]):
        if thingList[i][1].lower() in string1.lower() or thingList[i] [1].lower() in string2.lower():
            tempThing.append(thingList[i][1])
            tempCode.append(thingList[i][0])

This code works fine, but it definitely is the bottleneck in my program and is slowing it down a lot. 这段代码可以正常工作，但这绝对是我程序中的瓶颈，并且减慢了速度。

Answer 1

You could use list comprehensions, they are faster than traditional for loops. 您可以使用列表推导，它们比传统的for循环要快。 Furthermore, there are a few minor improvements you could make to make your code run faster : 此外，您可以进行一些小的改进以使代码运行更快：

thing_list = [['Thing1', 'bo'], ['Thing2', 'b'], [ 'Thing3', 'ca'],
              ['Thing4', 'patrick']]*100
string1 = 'bobby'
string2 = 'patrick neils'

# Compute your lower strings before the for loops to avoid
# calling the function at each loop
st1_lower = string1.lower()
st2_lower = string2.lower()

# You can store both the item and the name in the same array to reduce
# the computing time and do it in one list comprehension
result = [[x[0], x[1]] for x in thing_list
          if (x[1].lower() in st1_lower) or (x[1].lower() in st2_lower) ]

Output : 输出：

[['Thing1', 'bo'], ['Thing2', 'b'], ['Thing4', 'patrick']]

Performance : 性能：

For loops : 172 µs ± 9.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 对于循环：每个循环172 µs±9.59 µs（平均±标准偏差，共运行7次，每个循环10000个）

List comprehension : 81.1 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 列表理解：每个循环81.1 µs±2.17 µs（平均值±标准偏差，共运行7次，每个循环10000次）

Answer 2

Numpy arrays will default to iterate over the rows, so no need to do for i in range(...) : Numpy数组将默认在行上进行迭代，因此无需对for i in range(...) ：

x = np.array(list(range(3)), list(range(3,6)))

for i in x:
    print(i)

[0 1 2]
[3 4 5]

# This yields the same result, so use the former
for i in range(x.shape[0]):
    print(x[i])

[0 1 2]
[3 4 5]

Next, you are spending a ton of time doing str.lower() over and over again. 接下来，您要花费大量时间一次又一次地执行str.lower() 。 I'd probably pre-lower all of your strings ahead of time: 我可能会提前降低所有琴弦：

y = np.array([list('ABC'), list('DEF')])

np.char.lower(y)
array([['a', 'b', 'c'],
       ['d', 'e', 'f']],
      dtype='<U1')

# apply this to string1 and string2
l_str1, l_str2 = string1.lower(), string2.lower()

Now your loop should look like: 现在，您的循环应如下所示：

l_str1, l_str2 = string1.lower(), string2.lower()

for val1, val2 in thingList:
    to_check = val2.lower()

    if to_check in l_str1 or to_check in l_str2:
        tempThing.append(val1)
        tempCode.append(val2)

Now you can apply this to a list comprehension: 现在，您可以将其应用于列表理解：

# you can zip these together so you aren't using str.lower() 
# for all of your if statements
tmp = ((*uprow) for uprow, (a, b) in zip(thingList, np.char.lower(thingList))
       if b in l_str1 or b in l_str2)

# this will unpack pairs
tempThing, tempCode = zip(*tmp)

如何找到数组中的哪些字符串是python中另一个字符串的子字符串？

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-07-11 18:56:43

解决方案2
0 2019-07-11 19:19:42

如何找到数组中的哪些字符串是python中另一个字符串的子字符串？

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-07-11 18:56:43

解决方案2 0 2019-07-11 19:19:42

解决方案1
0 已采纳 2019-07-11 18:56:43

解决方案2
0 2019-07-11 19:19:42