简体   繁体   English

美丽的汤和搜索结果

[英]Beautiful Soup and searching in results

These are my first steps with python, please bear with me. 这些是我使用python的第一步,请多多包涵。

Basically I want to parse a Table of Contents from a single Dokuwiki page with Beautiful Soup. 基本上,我想用美丽的汤从一个Dokuwiki页面解析一个目录。 The TOC looks like this: TOC如下所示:

<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>

<ul class="toc">
<li class="level1"><div class="li"><a href="#section">#</a></div>
<ul class="toc">
<li class="level2"><div class="li"><a href="#link1">One</a></div></li>
<li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
<li class="level2"><div class="li"><a href="#link3">Three</a></div></li>

I would like to be able to search in the content of the a-tags and if a result is found return its content and also return the href-link. 我希望能够搜索a标签的内容,如果找到结果,则返回其内容并返回href链接。 So if I search for "one" the result should be 因此,如果我搜索“一个”,结果应该是

One
#link1

What I have done so far: 到目前为止,我所做的是:

#!/usr/bin/python2

from BeautifulSoup import BeautifulSoup
import urllib2


#Grab and open URL, create BeatifulSoup object
url = "http://www.somewiki.at/wiki/doku.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Grab Table of Contents
grab_toc = soup.find('div', {"id":"dw__toc"})

#Look for all divs with class: li
ftext = grab_toc.findAll('div', {"class":"li"})
#Look for links
links = grab_toc.findAll('a',href=True)

#Iterate
for everytext in ftext:
    text = ''.join(everytext.findAll(text=True))
    data = text.strip()
    print data

for everylink in links:
    print everylink['href']

This prints out the data I want but I'm kind of lost to rewrite it to be able to search within the result and only return the searchterm. 这会打印出我想要的数据,但是我有点想重写它以便能够在结果中进行搜索,并且只返回搜索项。 Tried something like 尝试过类似的东西

if data == 'searchtearm':
        print data
        break
else:
        print 'Nothing found'

But this is kind of a weak search. 但这是一个较弱的搜索。 Is there a nicer way to do this? 有没有更好的方法可以做到这一点? In my example the Beatiful Soup resultset is changed into a list. 在我的示例中,Beatiful Soup结果集更改为一个列表。 Is it better to search in the result set in the first place, if so then how to do this? 首先搜索结果集中是否更好?如果是,那么该怎么做?

Instead of searching through the links one-by-one, have BeautifulSoup search for you, using a regular expression: 不用一个个地搜索链接,而是使用正则表达式让BeautifulSoup搜索:

import re

matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE))

This would find the first a link in the table of contents with the 3 characters one in the text somewhere. 这将找到的 a在表的内容与3个字符的链接one在某处的文本。 Then just print the link and text: 然后只需打印链接和文本:

print matching_link.string
print matching_link['href']

Short demo based on your sample: 根据您的样本的简短演示:

>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('''\
... <div id="dw__toc">
... <h3 class="toggle">Table of Contents</h3>
... <div>
... 
... <ul class="toc">
... <li class="level1"><div class="li"><a href="#section">#</a></div>
... <ul class="toc">
... <li class="level2"><div class="li"><a href="#link1">One</a></div></li>
... <li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
... <li class="level2"><div class="li"><a href="#link3">Three</a></div></li>
... </ul></ul>''')
>>> matching_link = soup.find('a', text=re.compile('one', re.IGNORECASE))
>>> print matching_link.string
One
>>> print matching_link['href']
#link1

In BeautifulSoup version 3, the above .find() call returns the contained NavigableString object instead. 在BeautifulSoup版本3中,上述.find()调用将返回包含的NavigableString对象。 To get back to the parent a element, use the .parent attribute: 要回到父a元素,使用.parent属性:

matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE)).parent
print matching_link.string
print matching_link['href']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM