简体   繁体   English

使用python regex提取干净的URL

[英]using python regex to extract clean URLs

Thanks! 谢谢! I used @nu11p01n73R 's answer from this post , and I got mostly the URLS, but still some some extra "noise" at the beginning and end. 我从这篇文章中使用了@ nu11p01n73R的答案,我主要获得了URL,但在开始和结束时仍然有一些额外的“噪音”。 I'm ideally looking for it to just print the URL - http://something.some - so the regex would remove the <a herf=" at the beginning of the URL and remove " data-metrics='{"action" : "Click Story 2"}'> at the end of it. 我理想情况下是在寻找它只是打印URL- http://something.some-因此正则表达式将在URL的开头删除<a herf="并删除" data-metrics='{"action" : "Click Story 2"}'>结尾处的" data-metrics='{"action" : "Click Story 2"}'> I tried modifying the expression to get that, but I'm having trouble that the URL begins and ends with a " - I think that is messing up me regex. Any suggestions? 我尝试修改该表达式以获取该表达式,但是URL的开头和结尾都带有一个“-我觉得这使正则表达式搞砸了。”我有麻烦。有什么建议吗?

URLs are embedded like this in .txt file: URL像这样嵌入在.txt文件中:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

I'd love the output to be: 我希望输出为:

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

Most recent code I used was: 我最近使用的代码是:

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print line

But this returns, for example: 但这返回,例如:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

Regex is not the right tool to parse html files. 正则表达式不是解析html文件的正确工具。 Because you intend, i post this solution. 因为您打算,所以我发布了此解决方案。

>>> import re
>>> file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            i = re.sub(r'^.*?<a href="([^"]*)".*', r'\1', i)
            print(i)

OR 要么

>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            print(re.search(r'^.*?<a href="([^"]*)".*', i).group(1))

You can use re.findall function to extract the content as 您可以使用re.findall函数将内容提取为

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print re.findall(r'(?<=")[^"]*(?=")', line)[0]

will produce an output as 将产生输出为

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM