子串有两种可能的正则表达式

Question

I extracted 1 long string from a webpage. 我从网页中提取了1个长字符串。 Using: 使用：

 x=re.findall(r"(?:l'article)\s\d+\w+.*;", xpath)

It extracted the following 2 strings: 它提取了以下2个字符串：

 l'article 1382 du code civil ;
 l'article 700 du code de procédure civile, les condamne à payer à la société Financière du cèdre la somme globale de 3 000 euros et rejette leurs demandes ;

However, the latter one is a bit long. 但是，后者有点长。 All I need is up to the ','. 我所需要的只是'，'。 is there a way to do this directly ? 有办法直接做到这一点吗？ So have my original regex command look for either the ';' 因此，让我原来的regex命令查找“;” or the ',' based on which one it encounters first. 或“，”基于它首先遇到的那个。

If not, can I apply regex to a list, or do I need to write a loop for that ? 如果没有，我可以将正则表达式应用于列表，还是需要为此编写一个循环？

Required outcome a list with: 所需结果列表：

 l'article 1382 du code civil
 l'article 700 du code de procédure civile

Note, I have to apply this to many pages and there might be many more of these in a page. 请注意，我必须将此选项应用于许多页面，并且页面中可能会有更多的页面。 Doing anything by hand or by specifically indicating an entry in a list is not possible. 手工或通过明确指示列表中的条目无法执行任何操作。

Answer 1

A couple things you seem to be missing the ungreedy operator, ? 你好像一对夫妇的事情是缺少ungreedy运营? in order to force the regex to stop searching after it find the first occurrence. 为了强制正则表达式在找到第一个匹配项后停止搜索。 Additionally, you can check for multiple characters by using [] (refer to the following). 此外，您可以使用[]检查多个字符（请参阅以下内容）。 Here would be the new code: 这将是新代码：

(?:l'article)\s\d+\w+.*?[;,]

Regex101: Regex101：

https://regex101.com/r/tYkNHK/1 https://regex101.com/r/tYkNHK/1

Answer 2

You can simplify your regex a lot: 您可以简化很多正则表达式：

(?:l'article) -> there is no need for the non-capturing group, so you could just remove it (?:l'article) ->不需要非捕获组，因此您可以将其删除
\\s\\d+\\w+ -> the check for \\w+ seems rather pointless (especially as this matches numbers without letters), so I think you could remove it. \\s\\d+\\w+ ->对\\w+的检查似乎毫无意义（特别是因为它匹配没有字母的数字），所以我认为您可以将其删除。 Or you are missing a space character to match eg 1382 du 或者，您缺少匹配的空格字符，例如1382 du
.*; to match anything up to , or ; 匹配任何东西来,或; you can simply use a negated character class, like [^;,]* which will match everything that's not one of those. 您可以简单地使用否定的字符类，例如[^;,]* ，它将匹配不属于其中的所有字符。

So your final regex could be either 所以您最终的正则表达式可能是

l'article\s\d+[^;,]*

or 要么

l'article\s\d+\s\w+[^;,]*

子串有两种可能的正则表达式

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-04-24 13:46:36

解决方案2
2 2017-04-24 14:00:01

子串有两种可能的正则表达式

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-04-24 13:46:36

解决方案2 2 2017-04-24 14:00:01

解决方案1
3 已采纳 2017-04-24 13:46:36

解决方案2
2 2017-04-24 14:00:01