简体   繁体   English

子串有两种可能的正则表达式

[英]Substring with two possibilities regex

I extracted 1 long string from a webpage. 我从网页中提取了1个长字符串。 Using: 使用:

 x=re.findall(r"(?:l'article)\s\d+\w+.*;", xpath)

It extracted the following 2 strings: 它提取了以下2个字符串:

 l'article 1382 du code civil ;
 l'article 700 du code de procédure civile, les condamne à payer à la société Financière du cèdre la somme globale de 3 000 euros et rejette leurs demandes ;

However, the latter one is a bit long. 但是,后者有点长。 All I need is up to the ','. 我所需要的只是','。 is there a way to do this directly ? 有办法直接做到这一点吗? So have my original regex command look for either the ';' 因此,让我原来的regex命令查找“;” or the ',' based on which one it encounters first. 或“,”基于它首先遇到的那个。

If not, can I apply regex to a list, or do I need to write a loop for that ? 如果没有,我可以将正则表达式应用于列表,还是需要为此编写一个循环?

Required outcome a list with: 所需结果列表:

 l'article 1382 du code civil
 l'article 700 du code de procédure civile

Note, I have to apply this to many pages and there might be many more of these in a page. 请注意,我必须将此选项应用于许多页面,并且页面中可能会有更多的页面。 Doing anything by hand or by specifically indicating an entry in a list is not possible. 手工或通过明确指示列表中的条目无法执行任何操作。

A couple things you seem to be missing the ungreedy operator, ? 你好像一对夫妇的事情是缺少ungreedy运营? in order to force the regex to stop searching after it find the first occurrence. 为了强制正则表达式在找到第一个匹配项后停止搜索。 Additionally, you can check for multiple characters by using [] (refer to the following). 此外,您可以使用[]检查多个字符(请参阅以下内容)。 Here would be the new code: 这将是新代码:

(?:l'article)\s\d+\w+.*?[;,]

Regex101: Regex101:

https://regex101.com/r/tYkNHK/1 https://regex101.com/r/tYkNHK/1

You can simplify your regex a lot: 您可以简化很多正则表达式:

  • (?:l'article) -> there is no need for the non-capturing group, so you could just remove it (?:l'article) ->不需要非捕获组,因此您可以将其删除
  • \\s\\d+\\w+ -> the check for \\w+ seems rather pointless (especially as this matches numbers without letters), so I think you could remove it. \\s\\d+\\w+ ->对\\w+的检查似乎毫无意义(特别是因为它匹配没有字母的数字),所以我认为您可以将其删除。 Or you are missing a space character to match eg 1382 du 或者,您缺少匹配的空格字符,例如1382 du
  • .*; to match anything up to , or ; 匹配任何东西来,; you can simply use a negated character class, like [^;,]* which will match everything that's not one of those. 您可以简单地使用否定的字符类,例如[^;,]* ,它将匹配不属于其中的所有字符。

So your final regex could be either 所以您最终的正则表达式可能是

l'article\s\d+[^;,]*

or 要么

l'article\s\d+\s\w+[^;,]*

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM