简体   繁体   English

Python定位没有重复的特定单词

[英]Python locate specific words without duplicates

I have a problem.我有个问题。 I am trying to find device names in a string.我正在尝试在字符串中查找设备名称。 All the device names that I am looking for are stored in a List.我要查找的所有设备名称都存储在一个列表中。 There is one thing very important in what I want:我想要的有一件非常重要的事情:

  • A command can have multiple devices!!!一个命令可以有多个设备!!!

Now the problem I have is this:现在我遇到的问题是:

I have two devices ( Fan and Fan Light ).我有两个设备( FanFan Light )。 When I give the command: Turn on Fan Light both devices have been found, but I want only Fan Light to be found.当我发出命令: Turn on Fan Light ,已找到两个设备,但我只想找到Fan Light I tried checking all the devices that have been found and set the longest one as found device like this:我尝试检查已找到的所有设备并将最长的设备设置为已找到的设备,如下所示:

# Create 2 dummy devices
device1 = {
    "name": "fan"
}

device2 = {
    "name": "fan light"
}


# Add devices to list
devices = []
devices.append(device1)
devices.append(device2)


# Given command
command = "Turn on fan light"
    
    
foundDevices = []

# Search devices in sentence
for device in devices:

    # Splits a device name if it has multiple words
    deviceSplit = device["name"].split()
    numOfSubNames = len(deviceSplit)

    # Checks for every sub-name if it is found in the string
    i = 0
    for subName in deviceSplit:
        if subName in command:
            i += 1

    # Checks if all names where located in string
    if i == numOfSubNames:
         foundDevices.append(device["name"])

# Checks if multiple devices have been found
if len(foundDevices) >= 2:
    largestNameLength = 0

    # Checks which device has the largest name
    for device in foundDevices:
        if (len(device) > largestNameLength):
            largestName = device
            largestNameLength = len(device)


    # Clears list and only add longest one
    foundDevices.clear()
    foundDevices.append(largestName)


print(foundDevices)

But that gives a problem when I say for example: "Turn on Fan Light and the Fan", because that command does contain multiple devices.但是当我说例如:“打开风扇灯和风扇”时会出现问题,因为该命令确实包含多个设备。 How can I scan for devices the way I want?如何以我想要的方式扫描设备?

A regular expression search is one way of quickly doing what you want, with a pattern made from the different device names.正则表达式搜索是一种快速执行所需操作的方法,它使用由不同设备名称构成的模式。

import re

def find_with_regex(command, pattern):
    return list(set(re.findall(pattern, command, re.IGNORECASE)))

I would also suggest building the reversed dictionary of device: name shape, maybe it would help quickly finding the code name of a given device.我还建议构建device: name的反向字典device: name形状,也许它有助于快速找到给定设备的代号。

devices = [{'name': 'fan light'}, {'name': 'fan'}]

# build a quick-reference dict with device>name structure
transformed = {dev: name for x in devices for name, dev in x.items()}
# should also help weeding out duplicated devices
# as it would raise an error as soon as it fids one

# print(transformed)
# {'fan light': 'name', 'fan': 'name'}

Special thanks to buddemat for pointing out that device names to be in a particular order for this solution to work, fixed it with reversed(sorted(... on the pattern making line from the next code block.特别感谢buddemat指出设备名称要按特定顺序排列才能使该解决方案起作用,并使用reversed(sorted(...在下一个代码块的模式制作行上)修复它。

Testing the function测试功能

test_cases = [
    'Turn on fan light',
    'Turn on fan light and fan',
    'Turn on fan and fan light',
    'Turn on fan and fan',
]

pattern = '|'.join(reversed(sorted(transformed)))
for command in test_cases:
    matches = find_with_regex(command, pattern)
    print(matches)

Output输出

['fan light']
['fan', 'fan light']
['fan', 'fan light']
['fan']

You can use the python regex module instead of the re module (to improve upon RichieV's nice answer), if you don't want to rely on sorting the list of devices to ensure the correct result.如果您不想依靠对设备列表进行排序来确保正确的结果,您可以使用 python regex模块而不是re模块(以改进 RichieV 的好答案)。

The problem with re is, is that it is not POSIX compliant and thus, the pipe operator | re的问题在于,它不符合 POSIX 标准,因此管道操作符| will not ensure that the longest leftmost match is returned (see also How to order regular expression alternatives to get longest match? ).不会确保返回最长的最左边匹配(另请参阅如何订购正则表达式替代项以获得最长匹配? )。

However, in regex you can specify (?p) before a regex pattern to ensure POSIX matching.但是,在regex您可以在regex表达式模式之前指定(?p)以确保 POSIX 匹配。

Altogether

import regex

devices = [{'name': 'fan'}, {'name': 'fan light'}]

test_cases = [
    'Turn on fan light',
    'Turn on fan light and fan',
    'Turn on fan and fan light',
    'Turn on fan and fan',
]

transformed = {dev: name for x in devices for name, dev in x.items()}

pattern = '|'.join(transformed)

for command in test_cases:
    matches = regex.findall(r'(?p)'+pattern,command)
    print(matches)

will give you会给你

['fan light']
['fan light', 'fan']
['fan', 'fan light']
['fan', 'fan']

regardless of the order of the dictionaries in devices .无论devices字典的顺序如何。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM