使用Python从网页中选择特定文本

Question

Although I love the program, I've gotten extremely tired of Calibre's weekly updating habit. 尽管我喜欢这个程序，但是我对Calibre每周更新的习惯感到非常厌倦。 To counteract that problem I'm trying to work with a python script that will automate the process. 为了解决该问题，我正在尝试使用可自动执行该过程的python脚本。

I have successfully opened the document, but I have trouble figuring out how to capture a specific piece of it for a string. 我已经成功打开了文档，但是在弄清楚如何为字符串捕获特定片段方面遇到麻烦。 Since Calibre's download link depends on the version number that needs to be retrieved. 由于Calibre的下载链接取决于需要检索的版本号。 Currently line 218 contains the following: 当前，第218行包含以下内容：

  <a href="/projects/calibre/files/latest/download?source=files" title="/0.8.34/calibre-portable-0.8.34.zip: released on 2012-01-06 07:22:08 UTC">

I need to retrieve "calibre-ebook.0.8.34" from the line. 我需要从该行中检索“ calibre-ebook.0.8.34”。 Any suggestions on how to make that work? 关于如何进行这项工作有什么建议吗？

import urllib2
print("Calibre is Updating")
url = urllib2.urlopen ( "http://sourceforge.net/projects/calibre/files" ).read()
print(url)

Answer 1

An amendment to your code: 您的代码的修正：

import urllib2
import re

print("Calibre is Updating")
url = urllib2.urlopen ( "http://sourceforge.net/projects/calibre/files" ).read()

result = re.search('title="/[0-9.]*/([a-zA-Z\-]*-[0-9\.]*)', url).groups()[0][:-1]
print(result)

What I'm doing here is using the re module to search for a string that matches your request and saving it to result. 我在这里使用的是re模块，以搜索与您的请求匹配的字符串并将其保存为结果。

I end up stripping the last character since my regex saves an extra dot. 我最后删除了最后一个字符，因为我的正则表达式节省了一个额外的点。 With some patience you can really nail it down to only what you need. 有了一些耐心，您就可以真正将其固定在所需的东西上。

使用Python从网页中选择特定文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-01-11 05:27:07

使用Python从网页中选择特定文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-01-11 05:27:07

解决方案1
1 已采纳 2012-01-11 05:27:07