简体   繁体   English

python解析与美丽的汤

[英]python parsing with beautiful soup

I have a question regarding HTML parsing with BeautifulSoup. 我有一个关于使用BeautifulSoup进行HTML解析的问题。 The website I am trying to parse is this one: http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html?page=1&pageSize=40 我要解析的网站就是这个网站: http ://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html?page=1& pageSize=40

At first I needed to write a function that would give me all h3-tags and all p-tags. 首先,我需要编写一个函数,该函数将为我提供所有h3标签和所有p标签。 I did that as follows: 我这样做如下:

    from bs4 import BeautifulSoup
    import urllib2
    website=urllib2.urlopen("http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html","r")

    def parseUsingSoup2(content):
        list1=soup.findAll('h3')
        list2=soup.findAll('p')
        return list1+list2        

    parseUsingSoup2(website)

The next part of the problem asks for a list of events (there is only one event though on the website) with 4 tuples: the time slot, the title, the type and the description. 问题的下一部分要求提供事件列表(虽然网站上只有一个事件),该列表包含4个元组:时隙,标题,类型和描述。

I don't really know how to start with that. 我真的不知道如何开始。 My first attempt was this: 我的第一次尝试是:

    def GeneratingListofEvents(content):
        event={}
        list=['time', 'title', 'feature', 'description']
        for item in list: 

However, I have no idea if this is heading in the right direction, and I haven't managed to retrieve for instance the time from the HTML document without typing it manually. 但是,我不知道这是否朝着正确的方向发展,而且我没有设法从HTML文档中检索时间,而没有手动输入时间。 Thank you in advance. 先感谢您。

Notice how all the info you need is in <div class="agendaright"> 请注意,您所需的所有信息都在<div class="agendaright">

from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.auc.nl/news-events/events-and-lectures/events-and-lectures.html","r")
soup = BeautifulSoup(html)

all = soup.find('div',class_="agendaright")
time = all.find('span',class_="event-time").text
# u'18:00 - 20:00'
title = all.h3.text
# u'Images Without Borders Violence, Visuality, and Landscape in Postwar Ambon, Indonesia'
feature = all.find('span',class_="feature").text
# u' | Lecture'
description = all.find('p',class_="event-description").text
# u'This lecture explores the thematization of the visual and expansion of\nits terrain exemplified by the gigantic hijacked billboards with Jesus\nfaces and the painted murals with Christian themes which arose during\nthe ...'

l = [time,title,feature,description]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM