[英]Sentence segmentation using Regex
I have few text(SMS) messages and I want to segment them using period('.') as a delimiter.我的文本(SMS)消息很少,我想使用句点('.')作为分隔符对它们进行分段。 I am unable to handle following types of messages.
我无法处理以下类型的消息。 How can I segment these messages using Regex in Python.
如何使用 Python 中的正则表达式对这些消息进行分段。
Before segmentation:分割前:
'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u' 'no of beds 8.please inform person in-charge.tq'
After segmentation:分割后:
'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u' 'no of beds 8' 'please inform person in-charge' 'tq'
Each line is a separate message每一行都是一个单独的消息
Updated:更新:
I am doing natural language processing and I feel its okay to treat '16.8mmmol/l'
and 'no of beds 8.2 cups of tea.'
我正在做自然语言处理,我觉得可以处理
'16.8mmmol/l'
和'no of beds 8.2 cups of tea.'
as same.一样。 80% accuracy is enough for me but I want to reduce
False Positive
as much as possible. 80% 的准确率对我来说已经足够了,但我想尽可能地减少
False Positive
。
Some weeks ago, I searched for a regex that would catch every string representing a number in a string, whatever the form in which the number is written, even the ones in scientific notation, even the indian numbers having commas: see this thread几周前,我搜索了一个正则表达式,它可以捕获表示字符串中数字的每个字符串,无论数字的书写形式如何,甚至是科学记数法形式的,甚至是带有逗号的印度数字:请参阅此线程
I use this regex in the following code to give a solution to your problem.我在以下代码中使用此正则表达式来解决您的问题。
Contrary to the other answers, in my solution a dot in '8.'与其他答案相反,在我的解决方案中, “8”中有一个点。 isn't considered as a dot on which a split must be done, because it can be read as a float having no digit after the dot.
不被视为必须在其上进行拆分的点,因为它可以被读取为在点后没有数字的浮点数。
import re
regx = re.compile('(?<![\d.])(?!\.\.)'
'(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
'' #---------------------------------
'([+-]?)'
'(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
'(?:0|,(?=0)|(?<!\d),)*'
'(?:'
'((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
'|\.(0)'
'|((?<!\.)\.\d+?)'
'|([\d,]+\.\d+?))'
'0*'
'' #---------------------------------
'(?:'
'([eE][+-]?)(?:0|,(?=0))*'
'(?:'
'(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
'|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
'|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
'0*'
')?'
'' #---------------------------------
'(?![.,]?\d)')
simpler_regex = re.compile('(?<![\d.])0*(?:'
'(\d+)\.?|\.(0)'
'|(\.\d+?)|(\d+\.\d+?)'
')0*(?![\d.])')
def split_outnumb(string, regx=regx, a=0):
excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
li = []
for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
li.append(string[a:xdot])
a = xdot + 1
li.append(string[a:])
return li
for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
'no of beds 8.please inform person in-charge.tq',
'no of beds 8.2 cups of tea.tarabada',
'this number .977 is a float',
'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
'no dot in this sentence',
''):
print 'sentence =',sentence
print 'splitted eyquem =',split_outnumb(sentence)
print 'splitted eyqu 2 =',split_outnumb(sentence,regx=simpler_regex)
print 'splitted gurney =',re.split(r"\.(?!\d)", sentence)
print 'splitted stema =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
print
result结果
sentence = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2 = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
sentence = no of beds 8.please inform person in-charge.tq
splitted eyquem = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2 = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema = ['no of beds 8', 'please inform person in-charge', 'tq']
sentence = no of beds 8.2 cups of tea.tarabada
splitted eyquem = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2 = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema = ['no of beds 8.2 cups of tea', 'tarabada']
sentence = this number .977 is a float
splitted eyquem = ['this number .977 is a float']
splitted eyqu 2 = ['this number .977 is a float']
splitted gurney = ['this number .977 is a float']
splitted stema = ['this number ', '977 is a float']
sentence = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2 = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']
sentence = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2 = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
sentence = no dot in this sentence
splitted eyquem = ['no dot in this sentence']
splitted eyqu 2 = ['no dot in this sentence']
splitted gurney = ['no dot in this sentence']
splitted stema = ['no dot in this sentence']
sentence =
splitted eyquem = ['']
splitted eyqu 2 = ['']
splitted gurney = ['']
splitted stema = ['']
I added a simpler_regex detecting numbers, from a post of mine in this thread我添加了一个更简单的正则表达式检测数字,来自我在这个线程中的帖子
I doesn't detect indian numbers and numbers in scientific notation but it gives in fact the same results我没有检测到印度数字和科学记数法中的数字,但它实际上给出了相同的结果
you can use a negative lookahead assertion to match a "."您可以使用否定的前瞻断言来匹配“。” not followed by a digit, and use
re.split
on this:后面没有数字,并在此使用
re.split
:
>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']
What about关于什么
re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')
The lookarounds ensure that either on one or the other side is not a digit.环视确保一侧或另一侧不是数字。 So this covers also the
16.8
case.所以这也涵盖了
16.8
的情况。 This expression will not split if there are on both sides digits.如果两边都有数字,这个表达式不会分裂。
It depends on your exact sentence, but you could try:这取决于您的确切句子,但您可以尝试:
.*?[a-zA-Z0-9]\.(?!\d)
See if that works.看看这是否有效。 This will keep in the quotes, but you can then remove them if needed.
这将保留在引号中,但如果需要,您可以删除它们。
"...".split(".")
split
is a Python builtin that separates a string at a specific character. split
是一个 Python 内置函数,用于在特定字符处分隔字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.