简体   繁体   English

使用 split() 在各种标点符号处分割字符串

[英]Dividing a string at various punctuation marks using split()

I'm trying to divide a string into words, removing spaces and punctuation marks.我正在尝试将字符串分成单词,删除空格和标点符号。

I tried using the split() method, passing all the punctuation at once, but my results were incorrect:我尝试使用split()方法,一次传递所有标点符号,但我的结果不正确:

>>> test='hello,how are you?I am fine,thank you. And you?'
>>> test.split(' ,.?')
['hello,how are you?I am fine,thank you. And you?']

I actually know how to do this with regexes already, but I'd like to figure out how to do it using split() .我实际上已经知道如何使用正则表达式来做到这一点,但我想弄清楚如何使用split()来做到这一点。 Please don't give me a regex solution.请不要给我正则表达式解决方案。

If you want to split a string based on multiple delimiters, as in your example, you're going to need to use the re module despite your bizarre objections, like this:如果你想根据多个分隔符分割一个字符串,就像你的例子一样,尽管你有奇怪的反对意见,你还是需要使用re模块,就像这样:

>>> re.split('[?.,]', test)
['hello', 'how are you', 'I am fine', 'thank you', ' And you', '']

It's possible to get a similar result using split , but you need to call split once for every character, and you need to iterate over the results of the previous split.这是有可能得到使用了类似的结果split ,但你需要调用一次拆分为每一个字符,你需要遍历以前的拆分结果。 This works but it's ugly:这有效,但很难看:

>>> sum([z.split() 
... for z in sum([y.split('?') 
... for y in sum([x.split('.') 
... for x in test.split(',')],[])], [])], [])
['hello', 'how', 'are', 'you', 'I', 'am', 'fine', 'thank', 'you', 'And', 'you']

This uses sum() to flatten the list returned by the previous iteration.这使用sum()来展平前一次迭代返回的列表。

这是我能想到的不使用 re 模块的最佳方法:

"".join((char if char.isalpha() else " ") for char in test).split()

由于你不想使用 re 模块,你可以使用这个:

 test.replace(',',' ').replace('.',' ').replace('?',' ').split()

A modified version of larsks' answer, where you don't need to type all punctuation characters yourself: larsks 答案的修改版本,您无需自己输入所有标点符号:

import re, string

re.split("[" + string.punctuation + "]+", test)
['hello', 'how are you', 'I am fine', 'thank you', ' And you', '']

You can write a function to extend usage of .split() :您可以编写一个函数来扩展.split()

def multi_split(s, seprators):
    buf = [s]
    for sep in seprators:
        for loop, text in enumerate(buf):
            buf[loop:loop+1] = [i for i in text.split(sep) if i]
    return buf

And try it:并尝试:

>>> multi_split('hello,how are you?I am fine,thank you. And you?', ' ,.?') ['hello', 'how', 'are', 'you', 'I', 'am', 'fine', 'thank', 'you', 'And', 'you'] >>> multi_split('hello,how are you?I am fine,thank you. And you?', ' ,.?') ['hello', 'how', 'are', 'you', 'I', 'am', 'fine', 'thank', 'you', 'And', 'you']

This will be clearer and can be used in other situations.这样会更清楚,可以在其他情况下使用。

Apologies for necroing - this thread comes up as the first result for non-regex splitting of a sentence.为坏死道歉 - 该线程是非正则表达式拆分句子的第一个结果。 Seeing as I had to come up with a non Python-specific method for my students, and that this thread didn't answer my question, I thought I would share just in case.看到我必须为我的学生想出一个非 Python 特定的方法,并且这个线程没有回答我的问题,我想我会分享以防万一。

The point of the code is to use no libraries (and it's quick on large files):代码的重点是不使用库(并且在大文件上很快):

sentence = "George Bernard-Shaw was a fine chap, I'm sure - who can really say?"
alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
words = []
word = ""
mode = 0
for ch in sentence:
    if mode == 1:
        words.append(word)
        word = ""
        mode = 0
    if ch in alpha or ch == "'" or ch == "-":
        word += ch
    else:
        mode = 1
words.append(word)
print(words)

Output:输出:

['George', 'Bernard-Shaw', 'was', 'a', 'fine', 'chap', "I'm", 'sure', '-', 'who', 'can', 'really', 'say']

I have literally just written this in about half an hour so I'm sure the logic could be cleaned up.我实际上只是在大约半小时内写了这个,所以我相信逻辑可以清理。 I have also acknowledged that it may require additional logic to deal with caveats such as hyphens correctly, as their use is inconsistent compared to something like an inverted comma.我也承认可能需要额外的逻辑来正确处理连字符等警告,因为与倒逗号之类的东西相比,它们的使用不一致。 Is there any module, indeed, that can do this correctly anyway?确实有任何模块可以正确执行此操作吗?

A simple way to keep punctuation or other delimiters is:保留标点符号或其他分隔符的一种简单方法是:

import re

test='hello,how are you?I am fine,thank you. And you?'

re.findall('[^.?,]+.?', test)

Result:结果:

['hello,', 'how are you?', 'I am fine,', 'thank you.', ' And you?']

maybe this can help someone.也许这可以帮助某人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM