简体   繁体   English

Python字符串替换:URL中的关键字

[英]Python String Replace: Keywords into URLs

I am going to replace some keywords with urls in a string, for example, 我将用字符串中的网址替换某些关键字,例如,

content.replace("Google","<a href="http://www.google.com">Google</a>")

However, I only want to replace keywords with urls ONLY if not already wrapped in a url. 但是,我只想用尚未替换为URL的关键字替换URL。

The content is simple HTML: 内容是简单的HTML:

<p><b>This is an example!</b></p><p>I love <a href="http://www.google.com">Google</a></p><p><a href="http://www.google.com"><img src="/google.jpg" /></a></p>

Mainly <a> and <img> tags. 主要是<a><img>标记。

The main question: How to determine if a keyword is already wrapped in a <a> or <img> tag? 主要问题:如何确定关键字是否已经包装在<a><img>标记中?

Here is a similar question in PHP find and replace keywords with urls ONLY if not already wrapped in a url , but the answer is not an efficient one. 这是PHP中类似的问题, 只有在尚未包装在url中的情况下才用URL查找和替换关键字 ,但是答案不是有效的。

Is there some better solutions in Python? Python中有更好的解决方案吗? Better with code examples. 代码示例更好。 Thanks! 谢谢!

I am using Beatiful Soup for parsing my HTML, since parsing HTML with regex can..be proven tricky. 我正在使用Beatiful Soup解析我的HTML,因为用正则表达式解析 HTML可能会很棘手。 If you use beautiful soup you can toy with previous_sibling and previous_element figure out what you need. 如果您使用美丽的汤,则可以使用previous_sibling和previous_element弄清楚自己需要什么。

You interact in this fashion: 您以这种方式进行交互:

soup.find_all('img')

As Chris-Top said, BeautifulSoup is the way to go: 正如Chris-Top所说,BeautifulSoup是必经之路:

from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import re    

html = """
<div>
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Dog'>fox</a> jumped over the lazy Dog</p>
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p>
    <p>See image for reference:</p>
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/>
</div>
"""
soup = BeautifulSoup(html)

#search term, url reference
keywords = [("dog","http://en.wikipedia.org/wiki/Dog"),
            ("fox","http://en.wikipedia.org/wiki/Fox")]

def insertLinks(string_value,string_href):
    for t in soup.findAll(text=re.compile(string_value, re.IGNORECASE)):
            if t.parent.name !='a':
                    a = Tag('a', name='a')
                    a['href'] = string_href
                    a.insert(0, NavigableString(string_value))
                    string_list = re.compile(string_value, re.IGNORECASE).split(t)
                    replacement_text = soup.new_string(string_list[0])
                    t.replace_with(replacement_text)
                    replacement_text.insert_after(a)
                    a.insert_after(soup.new_string(string_list[1]))


for word in keywords:
    insertLinks(word[0],word[1])

print soup

Will yield: 将产生:

<div>
    <p>The quick brown <a href="http://en.wikipedia.org/wiki/Dog">fox</a> jumped over the lazy <a href="http://en.wikipedia.org/wiki/Dog">dog</a></p>
    <p>The <a href="http://en.wikipedia.org/wiki/Dog">dog</a>, who was, in reality, not so lazy, gave chase to the <a href="http://en.wikipedia.org/wiki/Fox">fox</a>.</p>
    <p>See image for reference:</p>
    <img src="dog_chasing_fox.jpg" title="Dog chasing fox"/>
</div>

You can try adding a regular expression as mentioned in the previous post. 您可以尝试添加上一篇文章中提到的正则表达式。 First check your string against a regular expression to check whether it has already been wrapped in a URL. 首先对照正则表达式检查字符串,以检查字符串是否已经包装在URL中。 This should be pretty easy as a simple call to the re library and its search() method should do the trick. 这应该很容易,因为只需简单地调用re库及其search()方法即可解决问题。

Here is a nice tutorial if you need for regular expressions and the search method specifically: http://www.tutorialspoint.com/python/python_reg_expressions.htm 如果您需要正则表达式和特定的搜索方法,这是一个很好的教程: http : //www.tutorialspoint.com/python/python_reg_expressions.htm

After you check the string to see whether it is already wrapped in a URL or not, you can call the replace function if it is not already wrapped in a URL. 检查字符串以查看是否已将其包装在URL中之后,如果尚未将其包装在URL中,则可以调用replace函数。

Here is a quick example that I wrote: 这是我写的一个简单示例:

    import re

    x = "<a href=""http://www.google.com"">Google</a>"
    y = 'Google'

    def checkURL(string):
        if re.search(r'<a href.+', string):
            print "URL Wrapped Already"
            print string
        else:
            string = string.replace('Google', "<a href=""http://www.google.com"">Google</a>")
            print "URL Not Wrapped:"
            print string

    checkURL(x)
    checkURL(y)

I hope this answers your question! 我希望这回答了你的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM