正则表达式量词

Question

I'm new to regex and this is stumping me. 我是regex的新手，这让我很沮丧。

In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info . 在以下示例中，我要提取facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info 。 I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. 我已经读过一些懒惰的量词和回溯词，但是我仍然无法拼凑出正确的正则表达式。 I'd expect facebook.com\\/.*?sk=info to work but it captures too much. 我希望facebook.com\\/.*?sk=info可以正常工作，但它捕获了太多内容。 Can you guys help? 你们可以帮忙吗？

<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress"><a href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?sk=page_map" target="_self">7508 15th Avenue, Brooklyn, New York 11228</a></span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&amp;sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">

Answer 1

As much as I love regex, this is an html parsing task: 我非常喜欢正则表达式，这是一个html解析任务：

>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']

Answer 2

This works :) 这有效:)

facebook\.com\/[^>]*?sk=info

正则表达式可视化

Debuggex Demo Debuggex演示

With only .* it finds the first facebook.com , and then continues until the sk=info . 仅使用.*它将找到第一个 facebook.com ，然后继续直到sk=info为止。 Since there's another facebook.com between, you overlap them. 由于之间存在另一个facebook.com ，因此您将它们重叠。

The unique thing between that you don't want is a > (or < , among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info , as you want. 不需要的唯一事物是> （或< ，以及其他字符），因此将任何内容更改为除> 任何 东西都可以根据需要找到最接近 sk=info的facebook.com 。

And yes, using regex for HTML should only be used in basic tasks. 是的，将正则表达式用于HTML应该仅用于基本任务。 Otherwise, use a parser. 否则，请使用解析器。

Answer 3

The problem is that you have an other facebook.com part. 问题是您还有另一个facebook.com部分。 You can restrict the .* not to match " so it needs to stay within one attribute: 您可以限制.*不匹配"因此它需要保留在一个属性内：

facebook\.com\/[^"]*;sk=info

Answer 4

Why your pattern doesn't work: 为什么您的模式不起作用：

You pattern doesn't work because the regex engine try your pattern from left to right in the string. 您的模式不起作用，因为正则表达式引擎会在字符串中从左到右尝试您的模式。

When the regex engine meets the first facebook.com\\/ in the string, and since you use .*? 当正则表达式引擎遇到字符串中的第一个facebook.com\\/时，并且既然您使用.*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines). 之后，正则表达式引擎将增加（可能的）匹配导致所有的字符（包括"或>或空格），直到它找到sk=info （因为.可以匹配除了换行符任何字符）。

This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first). 这就是为什么fejese建议用[^"]代替点或aliteralmind建议用[^>]代替点以使模式在字符串中此位置（第一个）失败的原因。

Using an html parser is the easiest way if you want to deal with html. 如果要处理html，使用html解析器是最简单的方法。 However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task. 但是，对于ponmatchl匹配或搜索/替换，请注意，如果html解析器提供了安全性，简便性，则在性能方面会产生成本，因为您需要为单个任务加载整个文档树。

正则表达式量词

问题描述

4 个解决方案

解决方案1
4 2014-03-29 23:05:14

解决方案2
3 2014-03-29 23:00:21

解决方案3
2 2014-03-29 22:58:48

解决方案4
2 已采纳 2014-03-30 00:49:40

正则表达式量词

问题描述

4 个解决方案

解决方案1 4 2014-03-29 23:05:14

解决方案2 3 2014-03-29 23:00:21

解决方案3 2 2014-03-29 22:58:48

解决方案4 2 已采纳 2014-03-30 00:49:40

解决方案1
4 2014-03-29 23:05:14

解决方案2
3 2014-03-29 23:00:21

解决方案3
2 2014-03-29 22:58:48

解决方案4
2 已采纳 2014-03-30 00:49:40