简体   繁体   中英

Python RegEx: Not capturing all the data (python3.6, scrapy)

I was trying to script a website of length information using the following simple code:

list = re.findall('(?<=Length:\s\s)[:\d]+', response.text)      
if len(list) > 0:            
    data['Length'] = list[0]        
else:            
    data['Length'] = '00:00'

However, it only gets the information if the length information is less than one hour. For example, it gets the 51:00 but not 01:08:47. I checked the source code for both shorter and longer than one hour. Here are how they look. It seems that for length more than 1 hour, there is one less white space. So I tried, but this time, list only returns a white space. Does anybody know how to get both short and long information? Thank you very much!

list = re.findall('(?<=Length:)[\s:\d]+', response.text)      
if len(list) > 0:            
    data['Length'] = list[0]        
else:            
    data['Length'] = '00:00'

在此处输入图片说明

在此处输入图片说明

您需要'(?<=Length:)\\s*(\\d\\d[\\s*:\\s*\\d\\d]+)'

Try this Regex and extract whatever is present in group 1:

Length\s*:\s*(\d+\s*(?::\s*\d+\s*){1,2})

Click for Demo

Explanation:

  • Length\\s*: - matches Length literally followed by 0+ occurrences of a white-space, as many as possible
  • :\\s* - matches a : followed by 0+ white-spaces
  • \\d+\\s* - matches 1+ occurrences of a digit followed by 0+ white-spaces. We start capturing the text from here in Group 1. We capture until the end of the match.
  • (?::\\s*\\d+\\s*){1,2} - matches either 1 or 2 occurrences of the pattern (?::\\s*\\d+\\s*)
    • (?:) - indicates a non-capturing group
    • :\\s* - matches a : followed by 0+ occurrences of a white-space
    • \\d+ - matches 1+ occurrences of a digit
    • \\s* - matches 0+ occurrences of a white-space

Alternative Regex:(without any group)

(?<=Length:\\s\\s)\\d+\\s*(?::\\s*\\d+\\s*){1,2}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM