在一个大字符串中查找多次出现的不同URL，其中每个URL使用Python在两个特定的子字符串之间

Question

I have a file containing just one long string which has multiple URLs embedded in it. 我有一个仅包含一个长字符串的文件，该字符串中嵌入了多个URL。 The URLs are all different but are always enclosed between two specific substrings. URL都是不同的，但始终包含在两个特定的子字符串之间。 How can I extract all the URLs? 如何提取所有URL？

My File Contents look like the following: 我的文件内容如下所示：

data-starred-src="www.example.com" data-non-starred-src asdf asdf ghgh data-starred-src="www.someurl.com" data-non-starred-src gjsltg ajshssl ahssfh data-starred-src="www.anotherurl.com" data-non-starred-src

I want to extract URLs in the form 我想提取表单中的URL

www.example.com
www.someurl.com
www.anotherurl.com

Answer 1

On the example, this one: 在示例中，此示例：

print re.findall(r'data-starred-src\s*=\s*"([^"]*)"', line)

Gives: 得到：

['www.example.com', 'www.someurl.com', 'www.anotherurl.com']

Answer 2

This should do it: 应该这样做：

(?<=\")([^"]+\.[^"]+\.[^"]+)(?=\")

Working regex example: 工作正则表达式示例：

http://regex101.com/r/sI2jL7 http://regex101.com/r/sI2jL7

or another example: 或另一个例子：

http://regex101.com/r/sI2jL7 http://regex101.com/r/sI2jL7

Answer 3

Try the following: 请尝试以下操作：

import re
r1 = re.compile('(?:AAA ")([^"]*)(?:" BBB)')
s = 'AAA "www.example.com" BBB asdf asdf ghgh AAA "www.someurl.com" BBB gjsltg ajshssl ahssfh AAA "www.anotherurl.com" BBB'
res = r1.findall(s)

You may also consider using finditer() if s is really long. 如果s真的很长，您也可以考虑使用finditer() 。

Updated re looks like this 更新后的内容看起来像这样

r1 = re.compile('(?:data-starred-src=")([^"]*)(?:" data-non-starred-src)')

but I've simply replaced AAA and BBB with new delimiters so it's possible that the code won't work if it didn't work before. 但是我只是用新的定界符替换了AAA和BBB，所以如果以前不起作用，则该代码可能无法起作用。

在一个大字符串中查找多次出现的不同URL，其中每个URL使用Python在两个特定的子字符串之间

问题描述

3 个解决方案

解决方案1
2 2014-02-08 06:12:39

解决方案2
1 已采纳 2014-02-08 05:57:37

解决方案3
0 2014-02-08 05:52:42

在一个大字符串中查找多次出现的不同URL，其中每个URL使用Python在两个特定的子字符串之间

问题描述

3 个解决方案

解决方案1 2 2014-02-08 06:12:39

解决方案2 1 已采纳 2014-02-08 05:57:37

解决方案3 0 2014-02-08 05:52:42

解决方案1
2 2014-02-08 06:12:39

解决方案2
1 已采纳 2014-02-08 05:57:37

解决方案3
0 2014-02-08 05:52:42