简体   繁体   English

如何使用python在文本块中找到文件名?

[英]How can I find a file name in a block of text using python?

I have gotten the HTML of a webpage using Python, and I now want to find all of the .CSS files that are linked to in the header. 我已经使用Python获得了网页的HTML,现在我想在标头中找到所有链接到的.CSS文件。 I tried partitioning, as shown below, but I got the error "IndexError: string index out of range" upon running it and save each as its own variable (I know how to do this part). 我尝试了分区,如下所示,但是运行它时出现错误“ IndexError:超出范围的字符串索引”,并将每个保存为自己的变量(我知道如何做这部分)。

sytle = src.partition(".css")
style = style[0].partition('<link href=')
print style[2]
c =1

I do no think that this is the right way to approach this, so would love some advice. 我不认为这是解决此问题的正确方法,因此希望您提供一些建议。 Many thanks in advance. 提前谢谢了。 Here is a section of the kind of text I am needing to extract .CSS file(s) from. 这是我需要从中提取.CSS文件的文本的一部分。

    <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />

<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />

You should use regular expression for this. 您应该为此使用正则表达式 Try the following: 请尝试以下操作:

/href="(.*\.css[^"]*)/g

EDIT 编辑

import re
matches = re.findall('href="(.*\.css[^"]*)', html)
print(matches)

My answer is along the same lines as Jon Clements' answer , but I tested mine and added a drop of explanation. 我的回答与乔恩·克莱门茨Jon Clements)的回答相同 ,但是我测试了我的回答并添加了一些解释。

You should not use a regex. 应该使用正则表达式。 You can't parse HTML with a regex . 您不能使用正则表达式解析HTML The regex answer might work, but writing a robust solution is very easy with lxml . regex答案可能有用,但是使用lxml编写健壮的解决方案非常容易。 This approach is guaranteed to return the full href attribute of all <link rel="stylesheet"> tags and no others. 确保此方法返回所有<link rel="stylesheet">标记的完整href属性,而不会返回其他标记。

from lxml import html

def extract_stylesheets(page_content):
    doc = html.fromstring(page_content)                        # Parse
    return doc.xpath('//head/link[@rel="stylesheet"]/@href')   # Search

There is no need to check the filenames, since the results of the xpath search are already known to be stylesheet links, and there's no guarantee that the filenames will have a .css extension anyway. 无需检查文件名,因为已知xpath搜索的结果是样式表链接,并且不能保证文件名的扩展名始终为.css The simple regex will catch only a very specific form, but the general html parser solution will also do the right thing in cases such as this, where the regex would fail miserably: 简单的正则表达式只能捕获非常特定的形式,但是常规的html解析器解决方案在这种情况下也会做正确的事情,在这种情况下,正则表达式会严重失败:

<link REL="stylesheet" hREf = 

     '/stylesheets/print?1342791421'
  media="print"
><!-- link href="/css/stylesheet.css" -->

It could also be easily extended to select only stylesheets for a particular media. 它也可以很容易地扩展为只为特定媒体选择样式表。

For what it's worth (using lxml.html) as a parsing lib. 值得(使用lxml.html)作为解析库。

untested 未经测试

import lxml.html
from urlparse import urlparse

sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />

<!--[if gte IE 7]><!-->
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" />

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" />
<!-- <![endif]-->
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" />
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" />
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />
"""

import lxml.html
page = lxml.html.fromstring(html)
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/@href')))
for href in link_hrefs:
    if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here
        pass # do whatever

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中,如何找到文件中的文本块? - In Python, how can I find a block of text in a file? 在将文本写入文件之前,如何使用python编辑内存中的文本块? - How can I edit a block of text in memory using python before writing the text to a file? 使用python和PIL如何获取图像中的文本块? - Using python and PIL how can I grab a block of text in an image? 如何在python cgi中找出上载的文件名 - how can i find out the uploaded file name in python cgi 如何在文本块中找到所有已知的成分字符串? - How can I find all known ingredient strings in a block of text? 我如何使用子进程在 python tkinter 中传递文件名 - how can i pass a file name in python tkinter using subprocess 如何在目录中找到文件名? - How can I find the file name in a directory? 解析 XML:如何使用 Python 从 XML 文件中具有相同名称但不同文本的行中获取所有信息? - Parsing XML: How can I get all the information from lines with same name but different text in XML file using Python? 如何从二进制文件中读取块并使用Python或Perl使用unpack提取结构? - How can I read a block from a binary file and extract structs using unpack using Python or Perl? 如何使用Python中的smtplib发送文本或文字文件? - How can i send a text or word file using the smtplib in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM