[英]Selecting specific text from a webpage using Python
Although I love the program, I've gotten extremely tired of Calibre's weekly updating habit. 尽管我喜欢这个程序,但是我对Calibre每周更新的习惯感到非常厌倦。 To counteract that problem I'm trying to work with a python script that will automate the process.
为了解决该问题,我正在尝试使用可自动执行该过程的python脚本。
I have successfully opened the document, but I have trouble figuring out how to capture a specific piece of it for a string. 我已经成功打开了文档,但是在弄清楚如何为字符串捕获特定片段方面遇到麻烦。 Since Calibre's download link depends on the version number that needs to be retrieved.
由于Calibre的下载链接取决于需要检索的版本号。 Currently line 218 contains the following:
当前,第218行包含以下内容:
<a href="/projects/calibre/files/latest/download?source=files" title="/0.8.34/calibre-portable-0.8.34.zip: released on 2012-01-06 07:22:08 UTC">
I need to retrieve "calibre-ebook.0.8.34" from the line. 我需要从该行中检索“ calibre-ebook.0.8.34”。 Any suggestions on how to make that work?
关于如何进行这项工作有什么建议吗?
import urllib2
print("Calibre is Updating")
url = urllib2.urlopen ( "http://sourceforge.net/projects/calibre/files" ).read()
print(url)
An amendment to your code: 您的代码的修正:
import urllib2
import re
print("Calibre is Updating")
url = urllib2.urlopen ( "http://sourceforge.net/projects/calibre/files" ).read()
result = re.search('title="/[0-9.]*/([a-zA-Z\-]*-[0-9\.]*)', url).groups()[0][:-1]
print(result)
What I'm doing here is using the re module to search for a string that matches your request and saving it to result. 我在这里使用的是re模块,以搜索与您的请求匹配的字符串并将其保存为结果。
I end up stripping the last character since my regex saves an extra dot. 我最后删除了最后一个字符,因为我的正则表达式节省了一个额外的点。 With some patience you can really nail it down to only what you need.
有了一些耐心,您就可以真正将其固定在所需的东西上。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.