简体   繁体   English

如何在Python中从字符串创建Google Like文本片段?

[英]How to create a Google Like Text Snippet from String in Python?

I am trying to build something similar to Google's text snippet. 我正在尝试构建类似于Google文本片段的内容。 The Google snippet contains highlighted keywords and "shifts" the text nicely in case a keyword does not appear right at the beginning of the analyzed string. Google代码段包含突出显示的关键字,并很好地“移动”了文本,以防关键字未恰好出现在分析字符串的开头。

For example: 例如:

keyword "nike" 关键字“ nike”

haystack string "lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor it is no wonder that nike is one of the largest brands in the world is not lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor 干草堆字符串“ lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorsum lorem ipsum dorlor lorem lorsum iperum lorsum lorsum ipsum lorsum lorsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorem dorlor难怪耐克不是世界上最大的品牌之一dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor

should become this snippet: 应该成为以下代码段:

... lorem ipsum dorlor it is no wonder that nike is one of the largest brands in the world is not lorem ipsum dorlor lorem dorlor lorem ipsum dorlor loremdorlor lorem ipsum dorlor loremipsum dorlor lorem ipsum dorlor lorem ... ... lorem ipsum dorlor难怪耐克是世界上最大的品牌之一。

This is what I have so far as an idea: 到目前为止,我的想法是:

keywordPosition = haystack.lower().index(keyword.lower())
snippetStart = keywordPosition - 100
snippetEnd = keywordPosition + 200
haystack = " ..." + haystack[snippetStart:snippetEnd] + " ..."

Is there an elegant way in python to dynamically adjust snippetStart and snippetEnd? python中有一种优雅的方式来动态调整snippetStart和snippetEnd吗? In many cases the above approach obviously throws an exception since the haystrack slice indices are out of range. 在许多情况下,上述方法显然会引发异常,因为haystrack slice索引超出范围。

I created a little example with comments for you here. 我在这里创建了一个带有注释的小例子。

http://pythonfiddle.com/google-snippet http://pythonfiddle.com/google-snippet

haystack = "lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor it is no wonder that nike is one of the largest brands in the world is not lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor"

needle = "nike342"

lookahead = 7  # Number of tokens to show before "nike"

tokens = haystack.split(" ")  # Split string into a list of tokens

found_index = -1  #  Represents the index of the token.  Initialize to -1 and assume it doesn't exist.

# Loop through tokens and compare each to the needle.  If we find the needle, rememeber the index and break out of the loop

found_index = tokens.index(needle)        

try:
    found_index = tokens.index(needle)
    # Get the max of the found index minus the number of words to show before the needle, and 0
    found_index = max(found_index - lookahead, 0)        

    # Create a sub list of the tokens from the found_index and end, then join those terms back together with a space.
    snippet = " ".join(tokens[found_index:len(tokens)])

except ValueError:
    snippet = ""  # No snippet or whatever error handling you are going to do

print snippet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM