Find multiple occurrences of different URLs in a big string, where each URL is between two specific substrings using Python

Question

I have a file containing just one long string which has multiple URLs embedded in it. The URLs are all different but are always enclosed between two specific substrings. How can I extract all the URLs?

My File Contents look like the following:

data-starred-src="www.example.com" data-non-starred-src asdf asdf ghgh data-starred-src="www.someurl.com" data-non-starred-src gjsltg ajshssl ahssfh data-starred-src="www.anotherurl.com" data-non-starred-src

I want to extract URLs in the form

www.example.com
www.someurl.com
www.anotherurl.com

Answer 1

On the example, this one:

print re.findall(r'data-starred-src\s*=\s*"([^"]*)"', line)

Gives:

['www.example.com', 'www.someurl.com', 'www.anotherurl.com']

Answer 2

This should do it:

(?<=\")([^"]+\.[^"]+\.[^"]+)(?=\")

Working regex example:

http://regex101.com/r/sI2jL7

or another example:

http://regex101.com/r/sI2jL7

Answer 3

Try the following:

import re
r1 = re.compile('(?:AAA ")([^"]*)(?:" BBB)')
s = 'AAA "www.example.com" BBB asdf asdf ghgh AAA "www.someurl.com" BBB gjsltg ajshssl ahssfh AAA "www.anotherurl.com" BBB'
res = r1.findall(s)

You may also consider using finditer() if s is really long.

Updated re looks like this

r1 = re.compile('(?:data-starred-src=")([^"]*)(?:" data-non-starred-src)')

but I've simply replaced AAA and BBB with new delimiters so it's possible that the code won't work if it didn't work before.

Find multiple occurrences of different URLs in a big string, where each URL is between two specific substrings using Python

Question

3 answers

solution1
2 2014-02-08 06:12:39

solution2
1 ACCPTED 2014-02-08 05:57:37

solution3
0 2014-02-08 05:52:42

Find multiple occurrences of different URLs in a big string, where each URL is between two specific substrings using Python

Question

3 answers

solution1 2 2014-02-08 06:12:39

solution2 1 ACCPTED 2014-02-08 05:57:37

solution3 0 2014-02-08 05:52:42

solution1
2 2014-02-08 06:12:39

solution2
1 ACCPTED 2014-02-08 05:57:37

solution3
0 2014-02-08 05:52:42