[英]Python RegEx: Not capturing all the data (python3.6, scrapy)
I was trying to script a website of length information using the following simple code: 我试图使用以下简单代码编写长度信息网站的脚本:
list = re.findall('(?<=Length:\s\s)[:\d]+', response.text)
if len(list) > 0:
data['Length'] = list[0]
else:
data['Length'] = '00:00'
However, it only gets the information if the length information is less than one hour. 但是,只有在长度信息少于一小时的情况下,它才能获取信息。 For example, it gets the 51:00 but not 01:08:47. 例如,它将获取51:00,但不会获取01:08:47。 I checked the source code for both shorter and longer than one hour. 我检查了源代码的时间是否短于一个小时。 Here are how they look. 这是它们的外观。 It seems that for length more than 1 hour, there is one less white space. 似乎长度超过1小时,空白空间减少了一个。 So I tried, but this time, list only returns a white space. 所以我尝试了,但是这次,列表仅返回空白。 Does anybody know how to get both short and long information? 有人知道如何同时获取简短信息和长期信息吗? Thank you very much! 非常感谢你!
list = re.findall('(?<=Length:)[\s:\d]+', response.text)
if len(list) > 0:
data['Length'] = list[0]
else:
data['Length'] = '00:00'
您需要'(?<=Length:)\\s*(\\d\\d[\\s*:\\s*\\d\\d]+)'
。
Try this Regex and extract whatever is present in group 1: 尝试使用此Regex并提取组1中存在的所有内容:
Length\s*:\s*(\d+\s*(?::\s*\d+\s*){1,2})
Explanation: 说明:
Length\\s*:
- matches Length
literally followed by 0+ occurrences of a white-space, as many as possible Length\\s*:
-匹配Length
字面量,后面尽可能多地出现0+个空格 :\\s*
- matches a :
followed by 0+ white-spaces :\\s*
-匹配一个:
后跟0+空格 \\d+\\s*
- matches 1+ occurrences of a digit followed by 0+ white-spaces. \\d+\\s*
-匹配1+个出现的数字,后跟0+个空格。 We start capturing the text from here in Group 1. We capture until the end of the match. 我们从第1组的此处开始捕获文本。我们捕获直到比赛结束。 (?::\\s*\\d+\\s*){1,2}
- matches either 1 or 2 occurrences of the pattern (?::\\s*\\d+\\s*)
(?::\\s*\\d+\\s*){1,2}
-匹配模式中出现的1或2次(?::\\s*\\d+\\s*)
(?:)
- indicates a non-capturing group (?:)
-表示非捕获组 :\\s*
- matches a :
followed by 0+ occurrences of a white-space :\\s*
-匹配一个:
然后出现0+次空格 \\d+
- matches 1+ occurrences of a digit \\d+
-匹配1+个数字 \\s*
- matches 0+ occurrences of a white-space \\s*
-匹配0+次出现的空白 Alternative Regex:(without any group) 替代正则表达式:(无任何组)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.