简体   繁体   English

Python:将字符串拆分为单词,保存分隔符

[英]Python: Splitting a string into words, saving separators

I have a string: 我有一个字符串:

'Specified, if char, else 10 (default).'

I want to split it into two tuples 我想把它分成两个元组

words=('Specified', 'if', 'char', 'else', '10', 'default')

separators=(',', ' ', ',', ' ', ' (', ').')

Does anyone have a quick solution of this? 有人对此有快速解决方案吗?

PS: this symbol '-' is a word separator, not part of the word PS:此符号'-'是单词分隔符,不是单词的一部分

import re
line = 'Specified, if char, else 10 (default).'
words = re.split(r'\)?[, .]\(?', line)
# words = ['Specified', '', 'if', 'char', '', 'else', '10', 'default', '']
separators = re.findall(r'\)?[, .]\(?', line)
# separators = [',', ' ', ' ', ',', ' ', ' ', ' (', ').']

If you really want tuples pass the results in tuple() , if you do not want words to have the empty entries (from between the commas and spaces), use the following: 如果您确实希望元组将结果传递给tuple() ,如果您不希望words具有空条目(从逗号到空格),请使用以下命令:

words = [x for x in re.split(r'\)?[, .]\(?', line) if x]

or 要么

words = tuple(x for x in re.split(r'\)?[, .]\(?', line) if x)

You can use regex for that. 您可以为此使用正则表达式。

>>> a='Specified, if char, else 10 (default).'
>>> from re import split
>>> split(",? ?\(?\)?\.?",a)
['Specified', 'if', 'char', 'else', '10', 'default', '']

But in this solution you should write that pattern yourself. 但是在此解决方案中,您应该自己编写该模式。 If you want to use that tuple, you should convert it contents to regex pattern for that in this solution. 如果要使用该元组,则应在此解决方案中将其内容转换为正则表达式模式。

Regex to find all separators (assumed anything that's not alpha numeric 正则表达式查找所有分隔符(假定不是字母数字的任何内容

import re
re.findall('[^\w]', string)

I probably would first .split() on spaces into a list, then iterate through the list, using a regex to check for a character after the word boundary. 我可能首先将空格上的.split()放入列表中,然后使用正则表达式在单词边界之后检查字符以遍历列表。

import re
s = 'Specified, if char, else 10 (default).'
w = s.split()
seperators = []
finalwords = []
for word in words:
    match = re.search(r'(\w+)\b(.*)', word)
    sep = '' if match is None else match.group(2)
    finalwords.append(match.group(1))
    seperators.append(sep)

In pass to get both separators and words you could use findall as follows: 通过获取分隔符和单词,您可以使用findall,如下所示:

import re
line = 'Specified, if char, else 10 (default).'
words = []
seps = []
for w,s in re.findall("(\w*)([), .(]+)", line):
   words.append(w)
   seps.append(s)

Here's my crack at it: 这是我的缺点:

>>> p = re.compile(r'(\)? *[,.]? *\(?)')
>>> tmp = p.split('Specified, char, else 10 (default).')
>>> words = tmp[::2]
>>> separators = tmp[1::2]
>>> print words
['Specified', 'char', 'else', '10', 'default', '']
>>> print separators
[', ', ', ', ' ', ' (', ').']

The only problem is you can have a '' at the end or the beginning of words if there is a separator at the beginning/end of the sentence without anything before/after it. 唯一的问题是,如果句子的开头/结尾有一个分隔符,而前后没有任何分隔符,那么您可以在words的末尾或开头有一个'' However, that is easily checked for and eliminated. 但是,这很容易检查和消除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM