简体   繁体   English

Python:用字符串中的标题名称替换url

[英]Python: replace urls with title names from a string

I would like to remove urls from a string and replace them with their titles of the original contents. 我想从字符串中删除网址,并将其替换为原始内容的标题。

For example: 例如:

mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com"

sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images"

For replacing url with the title, I have written this snipplet: 为了用标题替换url,我写了这个snipplet:

#get_title: string -> string
def get_title(url):
    """Returns the title of the input URL"""

    output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
    return output.title.string

I somehow need to apply this function to strings where it catches the urls and converts to titles via get_title. 我不知何故需要将此函数应用于捕获url的字符串,并通过get_title转换为titles。

Here is a question with information for validating a url in Python: How do you validate a URL with a regular expression in Python? 这是一个用于验证Python中URL的信息的问题: 如何在Python中使用正则表达式验证URL?

urlparse module is probably your best bet. urlparse模块可能是你最好的选择。 You will still have to decide what constitutes a valid url in the context of your application. 您仍然需要在应用程序的上下文中确定构成有效URL的内容。

To check the string for a url you will want to iterate over each word in the string check it and then replace the valid url with the title. 要检查字符串的URL,您需要迭代字符串中的每个单词检查它,然后用标题替换有效的URL。

example code (you will need to write valid_url): 示例代码(您需要编写valid_url):

def sanitize(mystring):
  for word in mystring.split(" "):
    if valid_url(word):
      mystring = mystring.replace(word, get_title(word))
  return mystring

You can probably solve this using regular expressions and substitution (re.sub accepts a function, which will be passed the Match object for each occurence and returns the string to replace it with): 您可以使用正则表达式和替换来解决此问题(re.sub接受一个函数,该函数将为每个出现时传递Match对象并返回字符串以替换它):

url = re.compile("http:\/\/(.*?)/")
text = url.sub(get_title, text)

The difficult thing is creating a regexp that matches an URL, not more, not less. 困难的是创建一个匹配URL的正则表达式,而不是更多,而不是更少。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM