使用python regex提取干净的URL

Question

Thanks! 谢谢！ I used @nu11p01n73R 's answer from this post , and I got mostly the URLS, but still some some extra "noise" at the beginning and end. 我从这篇文章中使用了@ nu11p01n73R的答案，我主要获得了URL，但在开始和结束时仍然有一些额外的“噪音”。 I'm ideally looking for it to just print the URL - http://something.some - so the regex would remove the <a herf=" at the beginning of the URL and remove " data-metrics='{"action" : "Click Story 2"}'> at the end of it. 我理想情况下是在寻找它只是打印URL- http：//something.some-因此正则表达式将在URL的开头删除<a herf="并删除" data-metrics='{"action" : "Click Story 2"}'>结尾处的" data-metrics='{"action" : "Click Story 2"}'> 。 I tried modifying the expression to get that, but I'm having trouble that the URL begins and ends with a " - I think that is messing up me regex. Any suggestions? 我尝试修改该表达式以获取该表达式，但是URL的开头和结尾都带有一个“-我觉得这使正则表达式搞砸了。”我有麻烦。有什么建议吗？

URLs are embedded like this in .txt file: URL像这样嵌入在.txt文件中：

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

I'd love the output to be: 我希望输出为：

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

Most recent code I used was: 我最近使用的代码是：

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print line

But this returns, for example: 但这返回，例如：

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

Answer 1

Regex is not the right tool to parse html files. 正则表达式不是解析html文件的正确工具。 Because you intend, i post this solution. 因为您打算，所以我发布了此解决方案。

>>> import re
>>> file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            i = re.sub(r'^.*?<a href="([^"]*)".*', r'\1', i)
            print(i)

OR 要么

>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            print(re.search(r'^.*?<a href="([^"]*)".*', i).group(1))

Answer 2

You can use re.findall function to extract the content as 您可以使用re.findall函数将内容提取为

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print re.findall(r'(?<=")[^"]*(?=")', line)[0]

will produce an output as 将产生输出为

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

使用python regex提取干净的URL

问题描述

2 个解决方案

解决方案1
1 2014-11-19 17:50:16

解决方案2
0 已采纳 2014-11-19 17:48:04

使用python regex提取干净的URL

问题描述

2 个解决方案

解决方案1 1 2014-11-19 17:50:16

解决方案2 0 已采纳 2014-11-19 17:48:04

解决方案1
1 2014-11-19 17:50:16

解决方案2
0 已采纳 2014-11-19 17:48:04