找出列表中最接近字符的字符串

Question

I have a pdf document that I have parsed into a list, say:我有一个已解析为列表的 pdf 文档，例如：

listTxt = ['met een motor, losse delen van caravans, losse delen van ',
           'aanhangwagens die in uw woonhuis, schuur of garage op ',
           'hetzelfde adres staan tot maximaal € 1.250,-.',
           ' ',
           ' horen deze losse delen bij een bedrijf? Of zijn ze bedoeld ',
           'aanhangwagens die niet kapot zijn verzekerd',  '• Schade door grondwater dat onverwacht het woonhuis ',
           'binnenstroomt door afvoerleidingen en apparaten die daarop ',
           'zijn aangesloten.',
           '• Schade door water dat uit een aquarium stroomt als het ',
           'aquarium onverwacht kapot is gegaan. We betalen ook voor de ',
           'inhoud van het aquarium tot maximaal € 1.250,-.',
           '• Schade door water dat uit een waterbed stroomt. Maar alleen als ',
           'het waterbed onverwacht kapot is gegaan.']

Now I want to return the string that is closest (in distance) to the euro symbol (€).现在我想返回最接近（距离）欧元符号（€）的字符串。 I have looked at various algos like levenshtein distance etc., but my task is actually quite simple and this distance can be merely number of characters.我看过各种算法，如 levenshtein distance 等，但我的任务实际上很简单，这个距离可以只是字符数。

Looping with a condition kind of works:循环有条件的作品：

for t in list:
    if 'aanhangwagens' and '€' in t:
        print(t)

Result:结果：

hetzelfde adres staan tot maximaal € 1.250,-.
inhoud van het aquarium tot maximaal € 1.250,-.

But I want that 'aanhangwagens' that is in listTxt [1] is really close to the next text listTxt [2] (with the €), so the desired output is:但我希望listTxt [1]中的'aanhangwagens'非常接近下一个文本listTxt [2] （带有 €），因此所需的输出是：

'aanhangwagens die in uw woonhuis, schuur of garage op ', 'hetzelfde adres staan tot maximaal € 1.250,-.'

for the phrase aquarium, it works fine because aquarium and € are in the same string ie listTxt[11]对于短语水族馆，它工作正常，因为水族馆和€在同一个字符串中，即listTxt[11]

'hetzelfde adres staan tot maximaal € 1.250,-.'

Answer 1

According to your definition, I wrote something that looks for close lines containing a certain characters.根据你的定义，我写了一些寻找包含特定字符的紧密行的东西。 First you need to compute two lists "resa" and "rese".首先，您需要计算两个列表“resa”和“rese”。 They tells if a given string is contained in your list.它们会告诉您的列表中是否包含给定的字符串。 For instance if you look for "a" in the list ["abc", "ccd", "efg", "agf"] the resulting list will be [1,0,0,1].例如，如果您在列表 ["abc", "ccd", "efg", "agf"] 中查找“a”，结果列表将是 [1,0,0,1]。 You need to compute these for 'aanhangwagens' and the euro symbol.您需要为 'aanhangwagens' 和欧元符号计算这些。 With these lists you can check the distances between the 1 in the euro list and the 1 in the 'aanhangwagens' list.使用这些列表，您可以检查欧元列表中的 1 和“aanhangwagens”列表中的 1 之间的距离。

In your example the search for 'aanhangwagens' gives: [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] and the euro gives: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]在您的示例中，搜索 'aanhangwagens' 给出：[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] 和欧元给出： [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

The algorithm that I wrote keeps the closest string, BUT if two strings have the same distance it will put them in the list of the results both.我写的算法保持最接近的字符串，但是如果两个字符串的距离相同，它将把它们都放在结果列表中。 Please, before use this code run some tests, I cannot assure you it will work in any case.请在使用此代码之前运行一些测试，我不能向您保证它在任何情况下都能正常工作。

resa=[]
rese=[]
for t in listTxt:
    if 'aanhangwagens' in t:
        resa.append(1)
    else:
        resa.append(0)
    if '€' in t:
        rese.append(1)
    else:
        rese.append(0)

def close_line(aliste, alista, alistTxt):
    all_closest_lines=[]
    for i in range(len(aliste)):
        if(aliste[i]==0):
            continue
        else:
            closest_line=[]
            amin=max(len(aliste), len(alista))
            for j in range(len(alista)):
                if(alista[j]==0):
                    continue
                else:
                    if(abs(i-j)<amin):
                        amin=abs(i-j)
                        closest_line=[]
                        closest_line.append([alistTxt[j], "Closest to € in position{}".format(i)])
                    elif(abs(i-j)==amin):
                        closest_line.append([alistTxt[j], "Closest to € in position{}".format(i)])
            all_closest_lines+=closest_line
    return(all_closest_lines)

print(close_line(rese, resa, listTxt))

Results:结果：

[['aanhangwagens die in uw woonhuis, schuur of garage op ', 'Closest to € in position2'], ['aanhangwagens die niet kapot zijn verzekerd', 'Closest to € in position11']]

Answer 2

You could try to generate a score to each sentence and then find groups of scores that correspond to groups of useful sentences.您可以尝试为每个句子生成一个分数，然后找到与有用句子组相对应的分数组。 Then you'd end up with a total score for each 'match'.然后你最终会得到每场“比赛”的总分。 I made a crude implementation below.我在下面做了一个粗略的实现。

import numpy as np


listTxt = ['met een motor, losse delen van caravans, losse delen van ',
           'aanhangwagens die in uw woonhuis, schuur of garage op ',
           'hetzelfde adres staan tot maximaal € 1.250,-.',
           ' ',
           ' horen deze losse delen bij een bedrijf? Of zijn ze bedoeld ',
           'aanhangwagens die niet kapot zijn verzekerd',  '• Schade door grondwater dat onverwacht het woonhuis ',
           'binnenstroomt door afvoerleidingen en apparaten die daarop ',
           'zijn aangesloten.',
           '• Schade door water dat uit een aquarium stroomt als het ',
           'aquarium onverwacht kapot is gegaan. We betalen ook voor de ',
           'inhoud van het aquarium tot maximaal € 1.250,-.',
           '• Schade door water dat uit een waterbed stroomt. Maar alleen als ',
           'het waterbed onverwacht kapot is gegaan.']

euro = np.array([string.count('€') for string in listTxt])
ahw = np.array([string.count('aanhangwagen') for string in listTxt])

all_values = np.add(euro,ahw)


score = []
matches = []
for i, value in enumerate(all_values):
    if value > 0:
        score.append(value)
        matches.append(listTxt[i])
    elif score:
        print(sum(score), matches)
        score = []
        matches = []

It counts the amount of times either '€' or 'aanhangwagen' is found in each sentence, then summates the result.它计算在每个句子中找到“€”或“aanhangwagen”的次数，然后总结结果。 Then make a small loop that finds the groups of 'close' values in between the zeroes.然后做一个小循环，找到零之间的“接近”值组。

That way you get a ranking of different (groups of) sentences and a score next to them on how many times your search words were in theses sentences.通过这种方式，您可以获得不同（组）句子的排名以及它们旁边的分数，即您的搜索词在这些句子中出现的次数。

In this case, the result is:在这种情况下，结果是：

2 ['aanhangwagens die in uw woonhuis, schuur of garage op ', 'hetzelfde adres staan tot maximaal € 1.250,-.']
1 ['aanhangwagens die niet kapot zijn verzekerd']
1 ['inhoud van het aquarium tot maximaal € 1.250,-.']

Which is what you wanted!这就是你想要的！

找出列表中最接近字符的字符串

问题描述

2 个解决方案

解决方案1
1 2020-01-31 13:25:58

解决方案2
1 已采纳 2020-01-31 13:28:13

找出列表中最接近字符的字符串

问题描述

2 个解决方案

解决方案1 1 2020-01-31 13:25:58

解决方案2 1 已采纳 2020-01-31 13:28:13

解决方案1
1 2020-01-31 13:25:58

解决方案2
1 已采纳 2020-01-31 13:28:13