简体   繁体   中英

Find multiple occurrences of different URLs in a big string, where each URL is between two specific substrings using Python

I have a file containing just one long string which has multiple URLs embedded in it. The URLs are all different but are always enclosed between two specific substrings. How can I extract all the URLs?

My File Contents look like the following:

data-starred-src="www.example.com" data-non-starred-src asdf asdf ghgh data-starred-src="www.someurl.com" data-non-starred-src gjsltg ajshssl ahssfh data-starred-src="www.anotherurl.com" data-non-starred-src

I want to extract URLs in the form

www.example.com
www.someurl.com
www.anotherurl.com

On the example, this one:

print re.findall(r'data-starred-src\s*=\s*"([^"]*)"', line)

Gives:

['www.example.com', 'www.someurl.com', 'www.anotherurl.com']

This should do it:

(?<=\")([^"]+\.[^"]+\.[^"]+)(?=\")

Working regex example:

http://regex101.com/r/sI2jL7

or another example:

http://regex101.com/r/sI2jL7

Try the following:

import re
r1 = re.compile('(?:AAA ")([^"]*)(?:" BBB)')
s = 'AAA "www.example.com" BBB asdf asdf ghgh AAA "www.someurl.com" BBB gjsltg ajshssl ahssfh AAA "www.anotherurl.com" BBB'
res = r1.findall(s)

You may also consider using finditer() if s is really long.

Updated re looks like this

r1 = re.compile('(?:data-starred-src=")([^"]*)(?:" data-non-starred-src)')

but I've simply replaced AAA and BBB with new delimiters so it's possible that the code won't work if it didn't work before.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM