简体   繁体   English

匹配包含URL的python中的正则表达式

[英]matching regular expressions in python which contains URLs

I have a list of URLS from which I am trying to fetch just the id numbers. 我有一个URLS列表,我试图从中获取ID号。 I am trying to solve this out using the combination of URLParse and regular expressions. 我正在尝试使用URLParse和正则表达式的组合来解决此问题。 Here is how my function looks like: 这是我的函数的样子:

def url_cleanup(url):
    parsed_url = urlparse(url)
    if parsed_url.query=="fref=ts":
        return 'https://www.facebook.com/'+re.sub('/', '', parsed_url.path)
    else:
        qry =  parsed_url.query
        result = re.search('id=(.*)&fref=ts',qry)
        return 'https://www.facebook.com/'+result.group(1)

However, I feel that the regular expression result = re.search('id=(.*)&fref=ts',qry) fails to match some of the URLs as explained in the below example. 但是,我感到正则表达式result = re.search('id=(.*)&fref=ts',qry)无法匹配某些URL,如以下示例中所述。

#1 
id=10001332443221607 #No match

#2 
id=6383662222426&fref=ts #matched

I tried to take the suggestion as per the suggestion provided in this answer by rephrasing my regular expression as id=(.*).+?(?=&fref=ts) which again matches #2 but not #1 in the above examples. 我试图通过将我的正则表达式改写为id=(.*).+?(?=&fref=ts)来按照答案中的建议采取建议,该示例在上面的示例中再次与#2匹配但与#1不匹配。

I am not sure what I am missing here. 我不确定我在这里缺少什么。 Any suggestion/hint will be much appreciated. 任何建议/提示将不胜感激。

Your regex's are wrong, indeed. 确实,您的正则表达式是错误的。

using the expression id=(.*)&fref=ts you will only match ids succeded by &fref=ts literally. 使用表达式id=(.*)&fref=ts您将仅按字面匹配由&fref=ts继承的id。

using id=(.*).+?(?=&fref=ts) you will do the same thing, but using the lookahead, which is a non-capturing group expression. 使用id=(.*).+?(?=&fref=ts)您将执行相同的操作,但使用前瞻,这是一个非捕获组表达式。 This means that your match will be only the id=blablabla part, but only if it's succeded by &fref=ts . 这意味着您的匹配将只是id=blablabla部分,但前提是&fref=ts成功。

Moreover, id=(.*) will match ids comprised of numbers, letters, symbols... literally anything. 而且, id=(.*)将匹配由数字,字母,符号...几乎所有内容组成的id。 Using id=\\d+ will match 'numbers only' ids. 使用id=\\d+将匹配“仅数字” ID。

So, try using 因此,尝试使用

result = re.search('id=(\d+)', qry)

it will allow you to catch just the numbers, supposing your ids are always digits, and capture(using the parenthesis) only these digits for later use. 假设您的ID始终是数字,并且仅捕获(使用括号)这些数字供以后使用,它将允许您仅捕获数字。

For further reference, refer to http://www.regular-expressions.info/python.html 有关更多参考,请参考http://www.regular-expressions.info/python.html

Your regex needs tweaking slightly. 您的正则表达式需要略微调整。 Try: 尝试:

result = re.search('id=(\d+)(&fref=ts)?', qry)

id=(\\d+) matches any number of digits following id= , and (&fref=ts)? id=(\\d+)匹配id=之后的任意位数,并且(&fref=ts)? allows the following group of letters to be optional. 允许以下字母组为可选。 This would allow you to add them back in if necessary. 这将使您可以在必要时将其重新添加。

You should also note that this will throw an error if no match is found - so you might want to change slightly to: 您还应该注意,如果找不到匹配项,这将引发错误-因此您可能需要稍作更改为:

result = re.search('id=(\d+)(&fref=ts)?', qry)
if result:
    return 'https://www.facebook.com/'+result.group(1)
else:
    # some error catch

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM