简体   繁体   English

Python RegEx:无法捕获所有数据(python3.6,scrapy)

[英]Python RegEx: Not capturing all the data (python3.6, scrapy)

I was trying to script a website of length information using the following simple code: 我试图使用以下简单代码编写长度信息网站的脚本:

list = re.findall('(?<=Length:\s\s)[:\d]+', response.text)      
if len(list) > 0:            
    data['Length'] = list[0]        
else:            
    data['Length'] = '00:00'

However, it only gets the information if the length information is less than one hour. 但是,只有在长度信息少于一小时的情况下,它才能获取信息。 For example, it gets the 51:00 but not 01:08:47. 例如,它将获取51:00,但不会获取01:08:47。 I checked the source code for both shorter and longer than one hour. 我检查了源代码的时间是否短于一个小时。 Here are how they look. 这是它们的外观。 It seems that for length more than 1 hour, there is one less white space. 似乎长度超过1小时,空白空间减少了一个。 So I tried, but this time, list only returns a white space. 所以我尝试了,但是这次,列表仅返回空白。 Does anybody know how to get both short and long information? 有人知道如何同时获取简短信息和长期信息吗? Thank you very much! 非常感谢你!

list = re.findall('(?<=Length:)[\s:\d]+', response.text)      
if len(list) > 0:            
    data['Length'] = list[0]        
else:            
    data['Length'] = '00:00'

在此处输入图片说明

在此处输入图片说明

您需要'(?<=Length:)\\s*(\\d\\d[\\s*:\\s*\\d\\d]+)'

Try this Regex and extract whatever is present in group 1: 尝试使用此Regex并提取组1中存在的所有内容:

Length\s*:\s*(\d+\s*(?::\s*\d+\s*){1,2})

Click for Demo 点击演示

Explanation: 说明:

  • Length\\s*: - matches Length literally followed by 0+ occurrences of a white-space, as many as possible Length\\s*: -匹配Length字面量,后面尽可能多地出现0+个空格
  • :\\s* - matches a : followed by 0+ white-spaces :\\s* -匹配一个:后跟0+空格
  • \\d+\\s* - matches 1+ occurrences of a digit followed by 0+ white-spaces. \\d+\\s* -匹配1+个出现的数字,后跟0+个空格。 We start capturing the text from here in Group 1. We capture until the end of the match. 我们从第1组的此处开始捕获文本。我们捕获直到比赛结束。
  • (?::\\s*\\d+\\s*){1,2} - matches either 1 or 2 occurrences of the pattern (?::\\s*\\d+\\s*) (?::\\s*\\d+\\s*){1,2} -匹配模式中出现的1或2次(?::\\s*\\d+\\s*)
    • (?:) - indicates a non-capturing group (?:) -表示非捕获组
    • :\\s* - matches a : followed by 0+ occurrences of a white-space :\\s* -匹配一个:然后出现0+次空格
    • \\d+ - matches 1+ occurrences of a digit \\d+ -匹配1+个数字
    • \\s* - matches 0+ occurrences of a white-space \\s* -匹配0+次出现的空白

Alternative Regex:(without any group) 替代正则表达式:(无任何组)

(?<=Length:\\s\\s)\\d+\\s*(?::\\s*\\d+\\s*){1,2}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM