[英]using python regex to extract clean URLs
Thanks! 谢谢! I used @nu11p01n73R 's answer from this post , and I got mostly the URLS, but still some some extra "noise" at the beginning and end. 我从这篇文章中使用了@ nu11p01n73R的答案,我主要获得了URL,但在开始和结束时仍然有一些额外的“噪音”。 I'm ideally looking for it to just print the URL - http://something.some - so the regex would remove the <a herf="
at the beginning of the URL and remove " data-metrics='{"action" : "Click Story 2"}'>
at the end of it. 我理想情况下是在寻找它只是打印URL- http://something.some-因此正则表达式将在URL的开头删除<a herf="
并删除" data-metrics='{"action" : "Click Story 2"}'>
结尾处的" data-metrics='{"action" : "Click Story 2"}'>
。 I tried modifying the expression to get that, but I'm having trouble that the URL begins and ends with a " - I think that is messing up me regex. Any suggestions? 我尝试修改该表达式以获取该表达式,但是URL的开头和结尾都带有一个“-我觉得这使正则表达式搞砸了。”我有麻烦。有什么建议吗?
URLs are embedded like this in .txt file: URL像这样嵌入在.txt文件中:
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >
I'd love the output to be: 我希望输出为:
http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
Most recent code I used was: 我最近使用的代码是:
file = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
print line
But this returns, for example: 但这返回,例如:
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >
Regex is not the right tool to parse html files. 正则表达式不是解析html文件的正确工具。 Because you intend, i post this solution. 因为您打算,所以我发布了此解决方案。
>>> import re
>>> file = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
>>> for i in file:
if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
i = re.sub(r'^.*?<a href="([^"]*)".*', r'\1', i)
print(i)
OR 要么
>>> for i in file:
if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
print(re.search(r'^.*?<a href="([^"]*)".*', i).group(1))
You can use re.findall
function to extract the content as 您可以使用re.findall
函数将内容提取为
file = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
print re.findall(r'(?<=")[^"]*(?=")', line)[0]
will produce an output as 将产生输出为
http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.