基于空白和尾随标点符号化？

Question

我正在尝试提出正则表达式，以根据空格或尾随标点将字符串分成列表。

例如

s = 'hel-lo  this has whi(.)te, space. very \n good'

我想要的是

['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

s.split()可以帮助我解决大部分问题，只是它不会处理尾随的空格。

Answer 1

import re
s = 'hel-lo  this has whi(.)te, space. very \n good'
[x for x in re.split(r"([.,!?]+)?\s+", s) if x]
# => ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

您可能需要调整什么是“标点符号”。

Answer 2

使用spacy粗略解决方案。 它已经可以对单词进行分词了。

import spacy
s = 'hel-lo  this has whi(.)te, space. very \n good'
nlp = spacy.load('en') 
ls = [t.text for t in nlp(s) if t.text.strip()]

>> ['hel', '-', 'lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

然而，它也与记号化的话-所以我借的解决方案，从这里到合并之间的话-重新走到一起。

merge = [(i-1, i+2) for i, s in enumerate(ls) if i >= 1 and s == '-']
for t in merge[::-1]:
    merged = ''.join(ls[t[0]:t[1]])
    ls[t[0]:t[1]] = [merged]

>> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

Answer 3

我正在使用Python 3.6.1。

import re

s = 'hel-lo  this has whi(.)te, space. very \n good'
a = [] # this list stores the items
for i in s.split(): # split on whitespaces
    j = re.split('(\,|\.)$',i) # split on your definition of trailing punctuation marks
    if len(j) > 1:
        a.extend(j[:-1])
    else:
        a.append(i)
 # a -> ['hel-lo', 'this', 'has', 'whi(.)te', ',', 'space', '.', 'very', 'good']

基于空白和尾随标点符号化？

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-04-27 02:26:17

解决方案2
0 2017-04-27 02:12:13

解决方案3
0 2017-04-27 04:03:38

基于空白和尾随标点符号化？

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-04-27 02:26:17

解决方案2 0 2017-04-27 02:12:13

解决方案3 0 2017-04-27 04:03:38

解决方案1
2 已采纳 2017-04-27 02:26:17

解决方案2
0 2017-04-27 02:12:13

解决方案3
0 2017-04-27 04:03:38