使用Beautiful Soup和Python解析元标记

Question

Im having trouble parsing a HTML page using Beautiful Soup 3, and python 2.6. 我使用Beautiful Soup 3和python 2.6解析HTML页面时遇到问题。

The HTML content is this: HTML内容是这样的：

content='<div class="egV2_EventReportCardLeftBlockShortWidth">
<span class="egV2_EventReportCardTitle">When</span>
<span class="egV2_EventReportCardBody">
<meta itemprop="startDate" content="2012-11-23T10:00:00.0000000">
<span class='egV2_archivedDateEnded'>STARTS</span>Fri 23 Nov,10:00AM<br/>
<meta itemprop="endDate" content="2012-12-03T18:00:00.0000000">
<span class='egV2_archivedDateEnded'>ENDS</span>Mon 03 Dec,6:00PM</span>
<span class="egV2_EventReportCardBody"></span>
<div class="egV2_div_cal" onclick=" showExportEvent()">
<div class="egV2_div_cal_outerFix">
<div class="egV2_div_cal_InnerAdjust"> Cal </div>
</div></div></div>'

And I want to get the string 'Fri 23 Nov,10:00AM' out of the middle into a variable, for concatenating, and sending back to a PHP page. 我想把字符串'11月23日星期五，10：00AM'从中间变成一个变量，用于连接，然后发送回PHP页面。

To read this content, i use the following code: (the content above comes through from a html page read (http://everguide.com.au/melbourne/event/2012-nov-23/life-with-bird-spring-warehouse-sale/) 要阅读此内容，我使用以下代码:(以上内容来自html页面阅读（http://everguide.com.au/melbourne/event/2012-nov-23/life-with-bird-spring - 仓储 - 销售/）

import urllib2
req = urllib2.Request(URL)
response = urllib2.urlopen(req)
html = response.read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html.decode('utf-8'))
soup.prettify()
import re
for node in soup.findAll(itemprop="name"):
    n = ''.join(node.findAll(text=True)) 
for node in soup.findAll("div", { "class" : "egV2_EventReportCardLeftBlockShortWidth" }):
    d = ''.join(node.findAll(text=True))
print n,"|", d

Which returns: 哪个回报：

[(ssh user)]# python testscrape.py

LIFE with BIRD Spring Warehouse Sale | 
When
<span class="egV2_EventReportCardDateTitle">STARTS</span>
STARTSFri 23 Nov,10:00AMENDSMon 03 Dec,6:00PM
<span class="egV2_EventReportCardDateTitle">ENDS</span>



 Cal 



[(ssh user)]#

(And it includes all those line breaks etc). （它包括所有这些换行等）。

So you can see there at the end, Im grouping both of those stripped strings into one printout, with a separator character in the middle to PHP can read back the string as one, and then break it apart. 因此，您可以在最后看到，我将这两个被剥离的字符串分组到一个打印输出中，在PHP的中间有一个分隔符，可以将字符串读回一个，然后将其拆分。

Problem is - the python code can read that page and store the text, but it includes all those rubbish and tags etc, that are confusing the PHP app. 问题是 - python代码可以读取该页面并存储文本，但它包含所有那些混乱PHP应用程序的垃圾和标签等。

I really just want returned: 我真的只想回来：

Fri 23 Nov,10:00AM

is it because Im using the findAll(text=True) method? 是因为我使用findAll（text = True）方法？

How can I drill down and get just the text only in that div - not the span tags too? 如何向下钻取并仅获取该div中的文本 - 而不是span标签？

Any help would be greatly appreciated, thank you. 非常感谢任何帮助，谢谢。

Rick - Melbourne. 里克 - 墨尔本。

Answer 1

Why not try something like 为什么不尝试类似的东西

In [95]: soup = BeautifulSoup(content)

In [96]: soup.find("span", {"class": "egV2_archivedDateEnded"})
Out[96]: <span class="egV2_archivedDateEnded">STARTS</span>

In [97]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next
Out[97]: u'STARTS'

In [98]: soup.find("span", {"class": "egV2_archivedDateEnded"}).next.next
Out[98]: u'Fri 23 Nov,10:00AM'

or even 甚至

In [99]: soup.find("span", {"class": "egV2_archivedDateEnded"}).nextSibling
Out[99]: u'Fri 23 Nov,10:00AM'

Answer 2

If you are just trying to extract a single tag that is easily identified with a particular attribute, pyparsing makes this pretty simple (I would go after the meta tag with its ISO8601 time string value): 如果您只是尝试提取一个易于使用特定属性标识的单个标记，则pyparsing会使这非常简单（我会使用其ISO8601时间字符串值继续使用元标记）：

from pyparsing import makeHTMLTags,withAttribute

meta = makeHTMLTags('meta')[0]
# only want matching <meta> tags if they have the attribute itemprop="startDate"
meta.setParseAction(withAttribute(itemprop="startDate"))

# scanString is a generator that yields (tokens,startloc,endloc) triples, we just 
# want the tokens
firstmatch = next(meta.scanString(content))[0]

Now convert to a datetime object, which can be formatted any way you like, written to a database, used to compute elapsed times, etc.: 现在转换为datetime对象，可以按照您喜欢的方式进行格式化，写入数据库，用于计算经过时间等：

from datetime import datetime
dt = datetime.strptime(firstmatch.content[:19], "%Y-%m-%dT%H:%M:%S")

print (firstmatch.content)
print (dt)

Prints: 打印：

2012-11-23T10:00:00.0000000
2012-11-23 10:00:00

使用Beautiful Soup和Python解析元标记

问题描述

2 个解决方案

解决方案1
3 已采纳 2012-11-25 23:32:25

解决方案2
0 2013-06-23 02:38:06

使用Beautiful Soup和Python解析元标记

问题描述

2 个解决方案

解决方案1 3 已采纳 2012-11-25 23:32:25

解决方案2 0 2013-06-23 02:38:06

解决方案1
3 已采纳 2012-11-25 23:32:25

解决方案2
0 2013-06-23 02:38:06