[英]Substring with two possibilities regex
I extracted 1 long string from a webpage. 我从网页中提取了1个长字符串。 Using:
使用:
x=re.findall(r"(?:l'article)\s\d+\w+.*;", xpath)
It extracted the following 2 strings: 它提取了以下2个字符串:
l'article 1382 du code civil ;
l'article 700 du code de procédure civile, les condamne à payer à la société Financière du cèdre la somme globale de 3 000 euros et rejette leurs demandes ;
However, the latter one is a bit long. 但是,后者有点长。 All I need is up to the ','.
我所需要的只是','。 is there a way to do this directly ?
有办法直接做到这一点吗? So have my original regex command look for either the ';'
因此,让我原来的regex命令查找“;” or the ',' based on which one it encounters first.
或“,”基于它首先遇到的那个。
If not, can I apply regex to a list, or do I need to write a loop for that ? 如果没有,我可以将正则表达式应用于列表,还是需要为此编写一个循环?
Required outcome a list with: 所需结果列表:
l'article 1382 du code civil
l'article 700 du code de procédure civile
Note, I have to apply this to many pages and there might be many more of these in a page. 请注意,我必须将此选项应用于许多页面,并且页面中可能会有更多的页面。 Doing anything by hand or by specifically indicating an entry in a list is not possible.
手工或通过明确指示列表中的条目无法执行任何操作。
A couple things you seem to be missing the ungreedy operator, ?
你好像一对夫妇的事情是缺少ungreedy运营
?
in order to force the regex to stop searching after it find the first occurrence. 为了强制正则表达式在找到第一个匹配项后停止搜索。 Additionally, you can check for multiple characters by using
[]
(refer to the following). 此外,您可以使用
[]
检查多个字符(请参阅以下内容)。 Here would be the new code: 这将是新代码:
(?:l'article)\s\d+\w+.*?[;,]
Regex101: Regex101:
https://regex101.com/r/tYkNHK/1 https://regex101.com/r/tYkNHK/1
You can simplify your regex a lot: 您可以简化很多正则表达式:
(?:l'article)
-> there is no need for the non-capturing group, so you could just remove it (?:l'article)
->不需要非捕获组,因此您可以将其删除 \\s\\d+\\w+
-> the check for \\w+
seems rather pointless (especially as this matches numbers without letters), so I think you could remove it. \\s\\d+\\w+
->对\\w+
的检查似乎毫无意义(特别是因为它匹配没有字母的数字),所以我认为您可以将其删除。 Or you are missing a space character to match eg 1382 du
1382 du
.*;
to match anything up to ,
or ;
,
或;
you can simply use a negated character class, like [^;,]*
which will match everything that's not one of those. [^;,]*
,它将匹配不属于其中的所有字符。 So your final regex could be either 所以您最终的正则表达式可能是
l'article\s\d+[^;,]*
or 要么
l'article\s\d+\s\w+[^;,]*
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.