简体   繁体   English

Python正则表达式没有给出所需的输出

[英]Python regex not giving desired output

I'm scraping a site which contains the following string我正在抓取一个包含以下字符串的站点

"1 Year+ in Category"

or in some cases或在某些情况下

"1 Year+ by user in Category

I want to separate the Year, Category and the User.我想将年份、类别和用户分开。 I tried using regular split but it doesn't work in this case because there are two delimiters 'in' and 'by'.我尝试使用常规拆分,但在这种情况下不起作用,因为有两个分隔符“in”和“by”。 So, I used regex.所以,我使用了正则表达式。 It kinda works but not properly.它有点工作但不正确。 Here is the snippet这是片段

dateandcat=re.split(r'.\s[in , by]',rightside[0])

rightside[0] contains date,category and user. rightside[0] 包含日期、类别和用户。 It results in the following output:结果如下:

['1 Year', 'n Movies']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'n Movies']

I could just trim off first two characters in [1] and [2] but I want to fix the regex.我可以剪掉 [1] 和 [2] 中的前两个字符,但我想修复正则表达式。 Why is second character of 'in' and 'by' still showing?为什么“in”和“by”的第二个字符仍然显示? How do I fix this?我该如何解决?

Try using:尝试使用:

import re

value = "1 Year+ in Category by User"

match = re.match(r"(\d+ \w+\+?) in (\w+)(?: by (\w+)*)?", value)
if match:
    print(match.groups())

Output:输出:

('1 Year+', 'Category', 'User')

You can use regex101 to learn more about that regex and others.您可以使用regex101了解有关该正则表达式和其他内容的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM