![](/img/trans.png)
[英]How to extract such text from javascript sentences in a html file by python
[英]How to extract character ngram from sentences? - python
以下word2ngrams
函数从一个单词中提取字符3grams:
>>> x = 'foobar'
>>> n = 3
>>> [x[i:i+n] for i in range(len(x)-n+1)]
['foo', 'oob', 'oba', 'bar']
这篇文章展示了单个单词的字符ngram提取, 使用python快速实现字符n-gram 。
但是,如果我有句子并且我想提取字符ngrams, 除了迭代调用word2ngram()
之外还有更快的方法吗?
实现相同word2ngram
和sent2ngram
输出的正则表达式版本是什么? 会更快吗?
我试过了:
import string, random, time
from itertools import chain
def word2ngrams(text, n=3):
""" Convert word into character ngrams. """
return [text[i:i+n] for i in range(len(text)-n+1)]
def sent2ngrams(text, n=3):
return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))
def sent2ngrams_simple(text, n=3):
text = text.lower()
return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]
# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]
start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start
start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start
print x==y
[OUT]:
0.0205280780792
0.0271739959717
True
EDITED
正则表达式方法看起来很优雅,但它比迭代调用word2ngram()
执行速度慢:
import string, random, time, re
from itertools import chain
def word2ngrams(text, n=3):
""" Convert word into character ngrams. """
return [text[i:i+n] for i in range(len(text)-n+1)]
def sent2ngrams(text, n=3):
return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))
def sent2ngrams_simple(text, n=3):
text = text.lower()
return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]
def sent2ngrams_regex(text, n=3):
rgx = '(?=('+'\S'*n+'))'
return re.findall(rgx,text)
# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]
start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start
start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start
start = time.time()
z = [sent2ngrams_regex(i) for i in sents]
print time.time() - start
print x==y==z
[OUT]:
0.0211708545685
0.0284190177917
0.0303599834442
True
为什么不(?=(...))
编辑相同的东西,但不是空格(?=(\\S\\S\\S))
edit2你也可以使用你想要的东西。 防爆。 仅使用alphanum (?=([^\\W_]{3}))
使用前瞻捕获3个字符。 然后发动机每次撞击位置1次
比赛。 然后捕获下一个3。
foobar
结果是
FOO
OOB
奥巴
酒吧
# Compressed regex
# (?=(...))
# Expanded regex
(?= # Start Lookahead assertion
( # Capture group 1 start
. # dot - metachar, matches any character except newline
. # dot - metachar
. # dot - metachar
) # Capture group 1 end
) # End Lookahead assertion
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.