简体   繁体   English

查找句子中的所有小写单词

[英]Find all lowercase words in a sentence

I have to to find all lowercase words in a sentence using Python. 我必须使用Python查找句子中的所有小写单词。 I've thought about using regular expression as follows: 我考虑过如下使用正则表达式:

import re
re.findall(r'\b[^A-Z()\s\d]+\b', 'A word, TWO words')

It works except for the case in which I have, for instance, Aword . 除了我有Aword的情况外,它都Aword How can I solve it? 我该如何解决?

In general, the regex should match the following cases: 通常,正则表达式应符合以下情况:

Aword --> output: word
A word --> output: word
A word word --> output [word, word]
A(word) AND A pers --> output [word, pers]
AwordWOrd --> output [word, rd]

You don't actually need regex for this task, you can use str methods. 您实际上不需要正则表达式来执行此任务,可以使用str方法。 The regex-based approach is quite fast, but it's possible to do it even faster, using str.translate . 基于正则表达式的方法非常快,但是使用str.translate可以更快地完成它。

Here's the fastest solution that I've found. 这是我找到的最快的解决方案。 We create a translation table (a dictionary) that maps each non-lowercase ASCII character to a space. 我们创建一个转换表(一个字典),将每个非小写的ASCII字符映射到一个空格。 Then we use str.split to split the resulting string up into a list; 然后,我们使用str.split将结果字符串拆分为一个列表; str.split() splits on any whitespace, and discards the whitespace, leaving only the desired words. str.split()在任何空格上分割,并丢弃该空格,仅保留所需的单词。

# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')

def find_lower(s):
    """ Translate non-lowercase chars to space """
    return s.translate(table).split()

Here's some test code that compares various approaches, including the regex solution of Ajax1234, as well as a few suggestions from regulars in the sopython chat room, including Kevin and user3483203 . 这里的一些测试代码比较了各种方法,包括Ajax1234的正则表达式解决方案,以及sopython聊天室中的常客的一些建议,包括Kevinuser3483203

The test data for this code consists of strings containing datalen words, with datalen running from 32 to 1024. Each word consists of 8 random characters; 此代码的测试数据由包含datalen单词的字符串组成, datalen为32到1024。每个单词包含8个随机字符; the random word generator mostly chooses lowercase letters. 随机词生成器主要选择小写字母。

As the timeit.Timer.repeat docs mention the important number in these results is the minimum one (the first in each list), the other numbers just indicate the impact on the results due to variations in the system load. 正如timeit.Timer.repeat文档提到的,这些结果中的重要数字是最小值 (每个列表中的第一个),其他数字仅表示由于系统负载变化而对结果的影响。

#! /usr/bin/env python3

""" Find all "words" of lowercase chars in a string

    Speed tests, using the timeit module, of various approaches

    See https://stackoverflow.com/q/51710087

    Written by Ajax1234, PM 2Ring, Kevin, and user3483203
    2018.08.07
"""

import re
from string import ascii_lowercase, printable
from timeit import Timer
from random import seed, choice

seed(17)

# A collection of chars with lots of lowercase
# letters to use for making random words
test_chars = 5 * ascii_lowercase + printable

def randword(n):
    """ Make a random "word" of n chars."""
    return ''.join([choice(test_chars) for _ in range(n)])

# Create a translation table that maps all ASCII chars
# except lowercase letters to space
bad = bytes(set(range(128)) - set(ascii_lowercase.encode()))
table = dict.fromkeys(bad, ' ')
def find_lower_pm2r(s, table=table):
    """ Translate non-lowercase chars to space """
    return s.translate(table).split()

def find_lower_pm2r_byte(s):
    """ Convert to bytes & test the ASCII code to see if it's in range """
    return bytes(b if 97 <= b <= 122 else 32 for b in s.encode()).decode().split()

def find_lower_ajax(s):
    """ Use a regex """
    return re.findall('[a-z]+', s)

def find_lower_kevin(s):
    """ Use the str.islower method """
    return "".join([c if c.islower() else " " for c in s]).split()

lwr = set(ascii_lowercase)
def find_lower_3483203(s, lwr=lwr):
    """ Test using a set """
    return ''.join([i if i in lwr else ' ' for i in s]).split()

functions = (
    find_lower_ajax,
    find_lower_pm2r,
    find_lower_pm2r_byte,
    find_lower_kevin,
    find_lower_3483203,
)

def verify(data, verbose=False):
    """ Check that all functions give the same results """
    if verbose:
        print('Verifying:', repr(data))
    results = []
    for func in functions:
        result = func(data)
        results.append(result)
        if verbose:
            print('{:20} : {}'.format(func.__name__, result))
    head, *tail = results
    return all(u == head for u in tail)

def time_test(loops, data):
    """ Perform the timing tests """
    timings = []
    for func in functions:
        t = Timer(lambda: func(data))
        result = sorted(t.repeat(3, loops))
        timings.append((result, func.__name__))
    timings.sort()
    for result, name in timings:
        print('{:20} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
    print()

# Check that all functions perform correctly
datalen = 8
data = ' '.join([randword(8) for _ in range(datalen)])
print(verify(data, True), '\n')

# Time it!
loops = 1024
datalen = 32
for _ in range(6):
    data = ' '.join([randword(8) for _ in range(datalen)])
    print('loops', loops, 'len', datalen, verify(data, False))
    time_test(loops, data)
    loops //= 2
    datalen *= 2

output 输出

Verifying: '3c/zpws% OO8Dtcgl u;Zdm{y. dx]JTyjb pj;+ ym\t O6d.Jbg8 f\tRxrbau z`rxnkI:'
find_lower_ajax      : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r      : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_pm2r_byte : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_kevin     : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
find_lower_3483203   : ['c', 'zpws', 'tcgl', 'u', 'dm', 'y', 'dx', 'yjb', 'pj', 'ym', 'd', 'bg', 'f', 'xrbau', 'z', 'rxnk']
True 

loops 1024 len 32 True
find_lower_pm2r      : 0.038420, 0.075005, 0.082880
find_lower_ajax      : 0.065296, 0.083511, 0.117944
find_lower_3483203   : 0.136276, 0.139128, 0.139208
find_lower_kevin     : 0.225619, 0.241822, 0.250794
find_lower_pm2r_byte : 0.249634, 0.257480, 0.268771

loops 512 len 64 True
find_lower_pm2r      : 0.026582, 0.026888, 0.027445
find_lower_ajax      : 0.059608, 0.061116, 0.074781
find_lower_3483203   : 0.129526, 0.130411, 0.163533
find_lower_kevin     : 0.217885, 0.219185, 0.219834
find_lower_pm2r_byte : 0.237033, 0.237225, 0.237880

loops 256 len 128 True
find_lower_pm2r      : 0.020133, 0.020144, 0.020194
find_lower_ajax      : 0.059215, 0.060153, 0.076451
find_lower_3483203   : 0.125678, 0.125989, 0.127963
find_lower_kevin     : 0.215228, 0.215832, 0.218419
find_lower_pm2r_byte : 0.234180, 0.237770, 0.240791

loops 128 len 256 True
find_lower_pm2r      : 0.017107, 0.017151, 0.017376
find_lower_ajax      : 0.061019, 0.062389, 0.074479
find_lower_3483203   : 0.123576, 0.123802, 0.126174
find_lower_kevin     : 0.212917, 0.213197, 0.214432
find_lower_pm2r_byte : 0.231248, 0.232049, 0.233519

loops 64 len 512 True
find_lower_pm2r      : 0.014723, 0.014752, 0.014787
find_lower_ajax      : 0.054442, 0.055595, 0.068130
find_lower_3483203   : 0.121101, 0.121847, 0.122723
find_lower_kevin     : 0.210416, 0.211491, 0.211810
find_lower_pm2r_byte : 0.232548, 0.232655, 0.234670

loops 32 len 1024 True
find_lower_pm2r      : 0.013886, 0.014000, 0.014106
find_lower_ajax      : 0.051643, 0.052614, 0.065182
find_lower_3483203   : 0.121135, 0.121708, 0.124333
find_lower_kevin     : 0.210581, 0.212073, 0.212232
find_lower_pm2r_byte : 0.245451, 0.251015, 0.252851

The results are for Python 3.6.0, on my ancient single core 32 bit 2GHz machine running a Debian derivative of Linux. 结果是在运行Debian衍生版Linux的我的古老单核32位2GHz机器上使用Python 3.6.0的。 YMMV. YMMV。


user3483203 has added some Pandas and matplotlib code to produce a graph from the timeit results. user3483203增加了一些熊猫和matplotlib码 ,以产生从所述的曲线图timeit结果。

时间结果图

You can use [az] : 您可以使用[az]

import re
_input = ['AwordWOrd', 'Aword', 'A word', 'A word word', 'A(word) AND A pers']
results = [re.findall('[a-z]+', i) for i in _input] 

Output: 输出:

[['word', 'rd'], ['word'], ['word'], ['word', 'word'], ['word', 'pers']]

I believe this should do the trick: 我相信这应该可以解决问题:

import re
re.findall(r'[a-z\s\d]+\b', 'Aword, TWO words')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM