從網站中提取特定行

Question

</span>
                    <div class="clearB paddingT5px"></div>
                    <small>
                        10/12/2015 5:49:00 PM -  Seeking Alpha
                    </small>
                    <div class="clearB paddingT10px"></div>

假設我有一個網站的源代碼，其中一部分看起來像這樣。 我試圖使“小”和“ /小”之間的界限。 在整個網頁中，有很多這樣的行，包含在“小”和“ /小”之間。 我想提取介於“小”和“ /小”之間的所有行。

我正在嘗試使用看起來像這樣的“正則表達式”功能

regex = '<small>(.+?)</small>'
datestamp = re.compile(regex)
urls = re.findall(datestamp, htmltext)

這僅返回空白。 請給我建議。

Answer 1

您可以通過以下兩種方法來解決此問題：

首先使用正則表達式，不建議：

import re

html = """</span>
    <div class="clearB paddingT5px"></div>
    <small>
        10/12/2015 5:49:00 PM -  Seeking Alpha
    </small>
    <div class="clearB paddingT10px"></div>"""

for item in re.findall('\<small\>\s*(.*?)\s*\<\/small\>', html, re.I+re.M):
    print '"{}"'.format(item)

其次，使用BeautifulSoup之類的方法為您解析HTML：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("small"):
    print '"{}"'.format(item.text.strip())

為這兩個提供以下輸出：

"10/12/2015 5:49:00 PM -  Seeking Alpha"

Answer 2

在這里使用xml.etree。 這樣，您就可以從網頁中獲取html數據，並使用urllib2 .....返回想要的任何標簽，就像這樣。

import urllib2
from xml.etree import ElementTree

url = whateverwebpageyouarelookingin
request = urllib2.Request(url, headers={"Accept" : "application/xml"})
u = urllib2.urlopen(request)
tree = ElementTree.parse(u)
rootElem = tree.getroot()
yourdata = rootElem.findall("small")  
print yourdata

從網站中提取特定行

問題描述

2 個解決方案

解決方案1
2 已采納 2015-10-13 10:23:50

解決方案2
0 2015-10-13 10:24:09

從網站中提取特定行

問題描述

2 個解決方案

解決方案1 2 已采納 2015-10-13 10:23:50

解決方案2 0 2015-10-13 10:24:09

解決方案1
2 已采納 2015-10-13 10:23:50

解決方案2
0 2015-10-13 10:24:09