简体   繁体   English

使用正则表达式提取网址

[英]Use regex to extract url

How can I use regex to extract url from the following text: 如何使用正则表达式从以下文本中提取网址:

/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=gptuu5b6kogtyatduicidq&ved=0cbqqfjaa&usg=afqjcnejdwki_gcnxgzsd4apxey1k2swlw

Desired result is: 期望的结果是:

http://www.linkedin.com/in/sujachandrasekaran

I used this 我用过这个

a = "/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=1jxuu8qxgtwaygs_u4gaaq&ved=0cceqfjaa&usg=afqjcnfl2pecdcddktw_pw9nelfohjp0ca"
linkedin_links = re.findall('(http.*)&',a)

and it gave me this: 它给了我这个:

u'http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=1jxuu8qxgtwaygs_u4gaaq&ved=0cceqfjaa'

Instead of a regex, use the appropriate tool for the job... 代替正则表达式,使用适合该工作的工具...

from urlparse import urlparse, parse_qs

url = '/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=gptuu5b6kogtyatduicidq&ved=0cbqqfjaa&usg=afqjcnejdwki_gcnxgzsd4apxey1k2swlw'
qs = parse_qs(urlparse(url).query)['q']
# ['http://www.linkedin.com/in/sujachandrasekaran']

It'll handle escaping, multiple q params and you don't have to worry where it appears in the query params. 它可以处理转义,多个q参数,而您不必担心它在查询参数中出现的位置。

TL;DR: Use '(http.*?)&' instead of '(http.*)&' . TL; DR:使用'(http.*?)&'代替'(http.*)&'

Your regex contains .* . 您的正则表达式包含.* This is by default greedy , meaning that it tries to match as much as possible. 默认情况下这是贪婪的 ,这意味着它尽可能地匹配。 In your case, it will therefore match everything up to (but excluding) the last & . 在您的情况下,它将匹配所有内容(但不包括)最后一个& Because you want to match only to first & , you must make the regex non-greedy with the ? 因为你只想匹配第一个& ,你必须使正则表达式非贪婪? modifier. 修改。 .*? tries to match as few characters as possible. 尝试匹配尽可能少的字符。 Ordinarily, that is an empty string, but because in your case it must be followed by & it will match up to the first & . 通常,这是一个空字符串,但因为在你的情况下它必须跟着&它将匹配第一个&

Here is simple regular expression that will do the job correctly in most cases http://[^&]* . 这是一个简单的正则表达式,可以在大多数情况下正确地完成工作http://[^&]*

....where [^&]* means: match all characters different from & as many times as possible. ....其中[^&]*意思是:尽可能匹配所有与&字符不同的字符。 However better regular expression must match only characters allowed in URL (not all characters as in my example). 但是,更好的正则表达式必须仅匹配URL中允许的字符(而不是我的示例中的所有字符)。

Maybe using dedicated tool is the best you can do but depending on the complexity of the task using regular expression might be just fine and simpler approach. 也许使用专用工具是您最好的选择,但是根据任务的复杂性,使用正则表达式可能是一种更好且更简单的方法。

You can use this expression: Select the first group: 您可以使用此表达式:选择第一个组:

/url\\?q=([^&]+) / URL \\ Q =([^&] +)

This will select everything after /url?q= and before &. 这将选择/ url?q =之后和&之前的所有内容。

This will add support for other urls like https and ftp 这将增加对其他URL的支持,例如https和ftp

#! /usr/bin/python

import re

a = "/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=1jxuu8qxgtwaygs_u4gaaq&ved=0cceqfjaa&usg=afqjcnfl2pecdcddktw_pw9nelfohjp0ca"

output = re.split ("\&", a )

final = re.split ("\=", output [0])

print final [1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM