简体   繁体   English

在python中拆分多个单词

[英]Split more than one word in python

How can I write a program in python that can split more than one word or character? 如何用python编写一个可以拆分多个单词或字符的程序? For example I have these sentences: Hi, This is a test. Are you surprised? 例如,我有这些句子: Hi, This is a test. Are you surprised? Hi, This is a test. Are you surprised? In this example i need my program to split these sentences by ',','!','?' 在此示例中,我需要我的程序将这些句子分隔为',','!','?' and '.'. 和'。'。 I know split in str library and NLTK but I need to know is there any internal pythonic way like split? 我知道str库和NLTK split,但我需要知道是否有任何内部pythonic方式(例如split)?

Use re.split: 使用re.split:

string = 'Hi, This is a test. Are you surprised?'
words = re.split('[,!?.]', string)
print(words)
[u'Hi', u' This is a test', u' Are you surprised', u'']

You are looking for the tokenize function of NLTK package. 您正在寻找NLTK软件包的tokenize功能。 NLTK stands for Natural Language Tool Kit NLTK代表自然语言工具包

Or try re.split from re module. 或尝试从re模块re re.split

From re doc. re doc。

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

I think I found a tricky way for my question. 我想我找到了一个棘手的方法。 I don't need to use any modules for that. 我不需要为此使用任何模块。 I can use replace method of str library and replace words like ! 我可以使用str库的replace方法并替换类似的单词! or ? 还是? with . . . Then I can use split method for my text to split word by . 然后,我可以使用split方法对文本进行逐字分割. .

def get_words(s):
    l = []
    w = ''
    for c in s:
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l



>>> s = "Hi, This is a test. Are you surprised?"
>>> print get_words(s)
['Hi', 'This', 'is', 'a', 'test', 'Are', 'you', 'surprised']


If you change '-!?,. ' into '-!?,.'
The output will be:
['Hi', ' This is a test', ' Are you surprised']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM