简体   繁体   English

修剪字符串(左和右)到最近的单词或句子

[英]Trim string (both left and right) to nearest word or sentence

I'm writing a function that finds a string near a identical string(s) in a larger piece of text. 我正在编写一个函数,在一个较大的文本中找到一个相同字符串附近的字符串。 So far so good, just not pretty. 到目前为止这么好,只是不漂亮。

I'm having trouble trimming the resulting string to the nearest sentence/whole word, without leaving any characters hanging over. 我无法将生成的字符串修剪为最近的句子/整个单词,而不会留下任何字符。 The trim distance is based on a number of words either side of the keyword. 修剪距离基于关键字两侧的多个单词。

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"

with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"

what I've got so far is based on character, not word distance. 到目前为止我得到的是基于角色,而不是文字距离。

2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"

However a regex could split it to the nearest whole word or sentence. 然而,正则表达式可以将它分成最接近的整个单词或句子。 Is that the most Pythonic way to achieve this? 这是实现这一目标的最恐怖的方式吗? This is what I've got so far: 这是我到目前为止所得到的:

import re

def trim_string(s, num):
  trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
  #^(.*)(marble)(.+) # only finds second occurrence???

  return trimmed

s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"


if t.lower() in s.lower():

  count = s.lower().count(t.lower())
  print ("%s occurrences of %s" %(count, t))

  original_s = s

  for i in range (0, count):
    idx = s.index(t.lower())
    # print idx

    dist = 10
    start = idx-dist
    end = len(t) + idx+dist
    a = s[start:end]

    print a
    print trim_string(a,5)

    s = s[idx+len(t):]

Thank you. 谢谢。

You can use this regex to match up to N non-whitespace substring on either side of marble : 您可以使用此正则表达式匹配marble两侧的N个非空白子字符串:

2 words: 2个字:

(?:(?:\S+\s+){0,2})?\bmarble\b\S*(?:\s+\S+){0,2}

RegEx Breakup: RegEx分手:

(?:(?:\S+\s+){0,2})? # match up to 2 non-whitespace string before keyword (lazy)
\bmarble\b\S*        # match word "marble" followed by zero or more non-space characters
(?:\s+\S+){0,2}      # match up to 2 non-whitespace string after keyword

RegEx Demo RegEx演示

1 word regex: 1字正则表达式:

(?:(?:\S+\s+){0,1})?\bmarble\b\S*(?:\s+\S+){0,1}

You can do this without re if you ignore the punctuation: 如果忽略标点符号,则无需re执行此操作:

import itertools as it
import string

def nwise(iterable, n):
    ts = it.tee(iterable, n)
    for c, t in enumerate(ts):
        next(it.islice(t, c, c), None)
    return zip(*ts)

def grep(s, k, n):
    m = str.maketrans('', '', string.punctuation)
    return [' '.join(x) for x in nwise(s.split(), n*2+1) if x[n].translate(m).lower() == k]

In []
keyword = "marble"
sentence = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
print('...\n...'.join(grep(sentence, keyword, n=2)))

Out[]:
Right. This marble is as...
...as this marble. Kwoo-oooo-waaa! Ahhhk!

In []:
print('...\n...'.join(grep(sentence, keyword, n=1)))

Out[]:
This marble is...
...this marble. Kwoo-oooo-waaa!

Using the ngrams() function from this answer , here's one approach which just takes all the n-grams and then chooses the ones with keyword in the middle: 使用这个答案中ngrams()函数,这里有一种方法,它只需要所有n-gram,然后在中间选择带有keyword的那些:

def get_ngrams(document, n):
    words = document.split(' ')
    ngrams = []
    for i in range(len(words)-n+1):
        ngrams.append(words[i:i+n])
    return ngrams

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

n = 3
pos = int(n/2 - .5)
# ignore punctuation by matching the middle word up to the number of chars in keyword
result = [ng for ng in get_ngrams(string, n) if ng[pos][:len(keyword)] == keyword]

more_itertools.adajacent 1 is a tool that probes neighboring elements. more_itertools.adajacent 1是一个探测相邻元素的工具。

import operator as op
import itertools as it

import more_itertools as mit


# Given
keyword = "marble"
iterable = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

Code

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['This marble is', 'this marble. Kwoo-oooo-waaa!']

neighbors = mit.adjacent(pred, words, distance=2)
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['Right. This marble is as', 'as this marble. Kwoo-oooo-waaa! Ahhhk!']

The OP may adjust the final output of these results as desired. OP可以根据需要调整这些结果的最终输出。


Details 细节

The given string has been split into an iterable of words . 给定的字符串已被拆分为可迭代的words A a simple predicate 2 was defined, returning True if the keyword (or a keyword with a trailing period) is found in the iterable. 定义了一个简单谓词 2 ,如果在iterable中找到关键字(或具有尾随句点的关键字),则返回True

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)
list(neighbors)

A list of (bool, word) tuples are returned from the more_itertools.adjacent tool: more_itertools.adjacent工具返回(bool, word)元组的more_itertools.adjacent

Output 产量

[(False, 'Right.'),
 (True, 'This'),
 (True, 'marble'),
 (True, 'is'),
 (False, 'as'),
 (False, 'slippery'),
 (False, 'as'),
 (True, 'this'),
 (True, 'marble.'),
 (True, 'Kwoo-oooo-waaa!'),
 (False, 'Ahhhk!')]

The first index is True for any valid occurences of keywords and neighboring words with a distance of 1. We use this boolean and itertools.groupby to find and group together consecutive, neighboring items. 对于关键字和距离为1的相邻单词的任何有效出现,第一个索引为True 。我们使用此布尔值和itertools.groupby查找并将连续的相邻项目组合在一起。 For example: 例如:

neighbors = mit.adjacent(pred, words, distance=1)
[(k, list(g)) for k, g in it.groupby(neighbors, op.itemgetter(0))]

Output 产量

[(False, [(False, 'Right.')]),
 (True, [(True, 'This'), (True, 'marble'), (True, 'is')]),
 (False, [(False, 'as'), (False, 'slippery'), (False, 'as')]),
 (True, [(True, 'this'), (True, 'marble.'), (True, 'Kwoo-oooo-waaa!')]),
 (False, [(False, 'Ahhhk!')])]

Finally, we apply a condition to filter the False groups and join the strings together. 最后,我们应用一个条件来过滤False组并将字符串连接在一起。

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]

Ouput 输出继电器

['This marble is', 'this marble. Kwoo-oooo-waaa!']

1 more_itertools is a third-party library that implements many useful tools including the itertools recipes . 1 more_itertools是第三方库,它实现了许多有用的工具,包括itertools配方

2 Note, stronger predicates can certainly be made for keywords with any punctuation, but this one was used for simplicity. 2 注意,对于带有任何标点符号的关键字,当然可以使用更强的谓词,但这个用于简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM