[英]Python regex not giving desired output
I'm scraping a site which contains the following string我正在抓取一个包含以下字符串的站点
"1 Year+ in Category"
or in some cases或在某些情况下
"1 Year+ by user in Category
I want to separate the Year, Category and the User.我想将年份、类别和用户分开。 I tried using regular split but it doesn't work in this case because there are two delimiters 'in' and 'by'.我尝试使用常规拆分,但在这种情况下不起作用,因为有两个分隔符“in”和“by”。 So, I used regex.所以,我使用了正则表达式。 It kinda works but not properly.它有点工作但不正确。 Here is the snippet这是片段
dateandcat=re.split(r'.\s[in , by]',rightside[0])
rightside[0] contains date,category and user. rightside[0] 包含日期、类别和用户。 It results in the following output:结果如下:
['1 Year', 'n Movies']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'y user', 'n TV shows']
['1 Year', 'n Movies']
I could just trim off first two characters in [1] and [2] but I want to fix the regex.我可以剪掉 [1] 和 [2] 中的前两个字符,但我想修复正则表达式。 Why is second character of 'in' and 'by' still showing?为什么“in”和“by”的第二个字符仍然显示? How do I fix this?我该如何解决?
Try using:尝试使用:
import re
value = "1 Year+ in Category by User"
match = re.match(r"(\d+ \w+\+?) in (\w+)(?: by (\w+)*)?", value)
if match:
print(match.groups())
Output:输出:
('1 Year+', 'Category', 'User')
You can use regex101 to learn more about that regex and others.您可以使用regex101了解有关该正则表达式和其他内容的更多信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.