[英]Splitting a string into words and punctuation
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.我正在尝试将一个字符串拆分为单词和标点符号,并将标点符号添加到拆分生成的列表中。
For instance:例如:
>>> c = "help, me"
>>> print c.split()
['help,', 'me']
What I really want the list to look like is:我真正希望列表看起来像的是:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.所以,我希望字符串在空格处拆分,标点符号从单词中拆分出来。
I've tried to parse the string first and then run the split:我尝试先解析字符串,然后运行拆分:
>>> for character in c:
... if character in ".,;!?":
... outputCharacter = " %s" % character
... else:
... outputCharacter = character
... separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.这会产生我想要的结果,但在大文件上速度非常慢。
Is there a way to do this more efficiently?有没有办法更有效地做到这一点?
This is more or less the way to do it:这或多或少是这样做的方法:
>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.诀窍是,不要考虑在哪里分割字符串,而是要在标记中包含什么。
Caveats:注意事项:
Here is a Unicode-aware version:这是一个支持 Unicode 的版本:
re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']
);第一个替代方案捕获单词字符序列(由 unicode 定义,因此 "résumé" 不会变成['r', 'sum']
); the second catches individual non-word characters, ignoring whitespace.第二个捕获单个非单词字符,忽略空格。
Note that, unlike the top answer, this treats the single quote as separate punctuation (eg "I'm" -> ['I', "'", 'm']
).请注意,与最佳答案不同,这将单引号视为单独的标点符号(例如“我是”-> ['I', "'", 'm']
)。 This appears to be standard in NLP, so I consider it a feature.这似乎是 NLP 中的标准,所以我认为它是一个特性。
Here's my entry.这是我的条目。
I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the ";.!" grouped together; this may or may not be a good thing).我怀疑这在效率方面的表现如何,或者它是否涵盖了所有情况(请注意“;.!”组合在一起;这可能是好事,也可能不是好事)。
>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>
One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.一个明显的优化是预先编译正则表达式(使用 re.compile),如果您要逐行执行此操作的话。
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.我正在尝试将字符串拆分为单词和标点符号,将标点符号添加到拆分生成的列表中。
For instance:例如:
>>> c = "help, me"
>>> print c.split()
['help,', 'me']
What I really want the list to look like is:我真正希望列表的样子是:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.所以,我希望字符串在空格处拆分,标点符号从单词中拆分出来。
I've tried to parse the string first and then run the split:我尝试先解析字符串,然后运行拆分:
>>> for character in c:
... if character in ".,;!?":
... outputCharacter = " %s" % character
... else:
... outputCharacter = character
... separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.这产生了我想要的结果,但在大文件上速度很慢。
Is there a way to do this more efficiently?有没有办法更有效地做到这一点?
Here's a minor update to your implementation.这是对您的实施的一个小更新。 If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.如果您想做更详细的事情,我建议您查看 le dorfier 建议的 NLTK。
This might only be a little faster since ''.join() is used in place of +=, which is known to be faster .这可能只会快一点,因为使用 ''.join() 代替 +=,众所周知后者更快。
import string
d = "Hello, I'm a string!"
result = []
word = ''
for char in d:
if char not in string.whitespace:
if char not in string.ascii_letters + "'":
if word:
result.append(word)
result.append(char)
word = ''
else:
word = ''.join([word,char])
else:
if word:
result.append(word)
word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
This worked for me这对我有用
import re
i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)
print(l)
empty = ['', ' ']
l = [el for el in l if el not in empty]
print(l)
Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']
I came up with a way to tokenize all words and \W+
patterns using \b
which doesn't need hardcoding:我想出了一种使用\b
不需要硬编码来标记所有单词和\W+
模式的方法:
>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']
Here .*?\S.*?
这里.*?\S.*?
is a pattern matching anything that is not a space and $
is added to match last token in a string if it's a punctuation symbol.是匹配任何非空格的模式,如果它是标点符号,则添加$
以匹配字符串中的最后一个标记。
Note the following though -- this will group punctuation that consists of more than one symbol:不过请注意以下几点——这会将包含多个符号的标点符号分组:
>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']
Of course, you can find and split such groups with:当然,您可以通过以下方式找到并拆分此类组:
>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
... print re.findall(r'(?:\w+|\W)', token)
['You']
['can']
['"', ',']
['she']
['said']
Try this:试试这个:
string_big = "One of Python's coolest features is the string format operator This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
for i in range(poistion_ofspace,x):
if string_big[i] == ' ':
break
else:
continue
print string_big[poistion_ofspace:(i+1)]
my_list.append(string_big[poistion_ofspace:(i+1)])
poistion_ofspace = i+1
print my_list
Have you tried using a regex?您是否尝试过使用正则表达式?
http://docs.python.org/library/re.html#re-syntax http://docs.python.org/library/re.html#re-syntax
By the way.顺便一提。 Why do you need the "," at the second one?为什么第二个需要“,”? You will know that after each text is written ie你会知道,每篇文字写完后即
[0] [0]
"," ","
[1] [1]
"," ","
So if you want to add the "," you can just do it after each iteration when you use the array..所以如果你想添加“,”你可以在每次迭代后使用数组时添加。
In case you are not allowed to import anything,use this!如果您不允许导入任何东西,请使用它!
word = "Hello,there"
word = word.replace("," , " ," )
word = word.replace("." , " .")
return word.split()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.