[英]How can I find a file name in a block of text using python?
I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. 我已经使用Python获得了网页的HTML,现在我想在标头中找到所有链接到的.CSS文件。 I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part).
我尝试了分区,如下所示,但是运行它时出现错误“ IndexError:超出范围的字符串索引”,并将每个保存为自己的变量(我知道如何做这部分)。
sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1
I do no think that this is the right way to approach this, so would love some advice. 我不认为这是解决此问题的正确方法,因此希望您提供一些建议。 Many thanks in advance.
提前谢谢了。 Here is a section of the kind of text I am needing to extract .CSS file(s) from.
这是我需要从中提取.CSS文件的文本的一部分。
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
You should use regular expression for this. 您应该为此使用正则表达式 。 Try the following:
请尝试以下操作:
/href="(.*\.css[^"]*)/g
EDIT 编辑
import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)
My answer is along the same lines as Jon Clements' answer , but I tested mine and added a drop of explanation. 我的回答与乔恩·克莱门茨 ( Jon Clements)的回答相同 ,但是我测试了我的回答并添加了一些解释。
You should not use a regex. 你不应该使用正则表达式。 You can't parse HTML with a regex .
您不能使用正则表达式解析HTML 。 The regex answer might work, but writing a robust solution is very easy with lxml .
regex答案可能有用,但是使用lxml编写健壮的解决方案非常容易。 This approach is guaranteed to return the full href attribute of all
<link rel="stylesheet">
tags and no others. 确保此方法返回所有
<link rel="stylesheet">
标记的完整href属性,而不会返回其他标记。
from lxml import html
def extract_stylesheets(page_content):
doc = html.fromstring(page_content) # Parse
return doc.xpath('//head/link[@rel="stylesheet"]/@href') # Search
There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css
extension anyway. 无需检查文件名,因为已知xpath搜索的结果是样式表链接,并且不能保证文件名的扩展名始终为
.css
。 The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably: 简单的正则表达式只能捕获非常特定的形式,但是常规的html解析器解决方案在这种情况下也会做正确的事情,在这种情况下,正则表达式会严重失败:
<link REL="stylesheet" hREf =
'/stylesheets/print?1342791421'
media="print"
><!-- link href="/css/stylesheet.css" -->
It could also be easily extended to select only stylesheets for a particular media. 它也可以很容易地扩展为只为特定媒体选择样式表。
For what it's worth (using lxml.html) as a parsing lib. 值得(使用lxml.html)作为解析库。
untested 未经测试
import lxml.html
from urlparse import urlparse
sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />
<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""
import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/@href')))
for href in link_hrefs:
if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
pass # do whatever
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.