简体   繁体   English

使用Beautiful Soup在python中解析网页

[英]Parsing web page in python using Beautiful Soup

I have some troubles with getting the data from the website. 从网站上获取数据我遇到了一些麻烦。 The website source is here: 网站来源如下:

view-source:http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO

there's sth like this: 有这样的事:

INFORMACJE O FILMIE INFORMACJE O FILMIE

Tytuł............................................: La mer à boire Tytuł............................................:Lameràboire

Ocena.............................................: IMDB - 6.3/10 (24) Ocena .............................................:IMDB - 6.3 / 10(24)

Produkcja.........................................: Francja Produkcja .........................................:Francja

Gatunek...........................................: Dramat Gatunek ...........................................:Dramat

Czas trwania......................................: 98 min. Czas trwania ......................................:98分钟

Premiera..........................................: 22.02.2012 - Świat Premiera ..........................................:22.02.2012 - Świat

Reżyseria........................................: Jacques Maillot Reżyseria........................................:Jacques Maillot

Scenariusz........................................: Pierre Chosson, Jacques Maillot Scenariusz ........................................:Pierre Chosson,Jacques Maillot

Aktorzy...........................................: Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel Aktorzy ...........................................:Daniel Auteuil,Maud Wyler ,YannTrégouët,Alain Beigel

And I want to get the data from this website to have a Python list of strings: 我想从这个网站获得数据的Python列表:

[[Tytuł, "La mer à boire"]
[Ocena, "IMDB - 6.3/10 (24)"]
[Produkcja, Francja]
[Gatunek, Dramat]
[Czas trwania, 98 min.]
[Premiera, "22.02.2012 - Świat"]
[Reżyseria, "Jacques Maillot"]
[Scenariusz, "Pierre Chosson, Jacques Maillot"]
[Aktorzy, "Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel"]]

I wrote some code using BeautifulSoup but I cant go any further, I just don't know what to get the rest from the website source and how to convert is to string ... Please, help! 我使用BeautifulSoup编写了一些代码,但我不能再进一步了,我只是不知道从网站源代码中得到什么,以及如何转换为字符串...请帮忙!

My code: 我的代码:

    # -*- coding: utf-8 -*-
#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

try :
    web_page = urllib2.urlopen("http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO").read()
    soup = BeautifulSoup(web_page)
    c = soup.find('span', {'class':'vi'}).contents
    print(c)
except urllib2.HTTPError :
    print("HTTPERROR!")
except urllib2.URLError :
    print("URLERROR!")

The secret of using BeautifulSoup is to find the hidden patterns of your HTML document. 使用BeautifulSoup的秘诀是找到HTML文档的隐藏模式。 For example, your loop 例如,你的循环

for ul in soup.findAll('p') :
    print(ul)

is in the right direction, but it will return all paragraphs, not only the ones you are looking for. 是在正确的方向,但它将返回所有段落,而不仅仅是你正在寻找的段落。 The paragraphs you are looking for, however, have the helpful property of having a class i . 但是,您正在寻找的段落具有类i的有用属性。 Inside these paragraphs one can find two spans, one with the class i and another with the class vi . 在这些段落中,可以找到两个跨度,一个带有类i ,另一个带有类vi We are lucky because those spans contains the data you are looking for: 我们很幸运,因为这些跨度包含您正在寻找的数据:

<p class="i">
    <span class="i">Tytuł............................................</span>
    <span class="vi">: La mer à boire</span>
</p>

So, first get all the paragraphs with the given class: 所以,首先得到给定类的所有段落:

>>> ps = soup.findAll('p', {'class': 'i'})
>>> ps
[<p class="i"><span class="i">Tytuł... <LOTS OF STUFF> ...pan></p>]

Now, using list comprehensions , we can generate a list of pairs, where each pair contains the first and the second span from the paragraph: 现在,使用列表推导 ,我们可以生成一对对列表,其中每对包含段落中的第一个和第二个跨度:

>>> spans = [(p.find('span', {'class': 'i'}), p.find('span', {'class': 'vi'})) for p in ps]
>>> spans
[(<span class="i">Tyt... ...</span>, <span class="vi">: La mer à boire</span>), 
 (<span class="i">Ocena... ...</span>, <span class="vi">: IMDB - 6.3/10 (24)</span>),
 (<span class="i">Produkcja.. ...</span>, <span class="vi">: Francja</span>),
 # and so on
]

Now that we have the spans, we can get the texts from them: 现在我们有了跨度,我们可以从中获取文本:

>>> texts = [(span_i.text, span_vi.text) for span_i, span_vi in spans]
>>> texts
[(u'Tytu\u0142............................................', u': La mer \xe0 boire'),
 (u'Ocena.............................................', u': IMDB - 6.3/10 (24)'),
 (u'Produkcja.........................................', u': Francja'), 
  # and so on
]

Those texts are not ok still, but it is easy to correct them. 那些文本仍然没有问题,但很容易纠正它们。 To remove the dots from the first one, we can use rstrip() : 要删除第一个点,我们可以使用rstrip()

>>> u'Produkcja.........................................'.rstrip('.')
u'Produkcja'

The : string can be removed with lstrip() : 可以使用lstrip()删除: string:

>>> u': Francja'.lstrip(': ')
u'Francja'

To apply it to all content, we just need another list comprehension: 要将其应用于所有内容,我们只需要另一个列表理解:

>>> result = [(text_i.rstrip('.'), text_vi.replace(': ', '')) for text_i, text_vi in texts]
>>> result
[(u'Tytu\u0142', u'La mer \xe0 boire'),
 (u'Ocena', u'IMDB - 6.3/10 (24)'),
 (u'Produkcja', u'Francja'),
 (u'Gatunek', u'Dramat'),
 (u'Czas trwania', u'98 min.'),
 (u'Premiera', u'22.02.2012 - \u015awiat'),
 (u'Re\u017cyseria', u'Jacques Maillot'),
 (u'Scenariusz', u'Pierre Chosson, Jacques Maillot'),
 (u'Aktorzy', u'Daniel Auteuil, Maud Wyler, Yann Tr&eacute;gou&euml;t, Alain Beigel'),
 (u'Wi\u0119cej na', u':'),
 (u'Trailer', u':Obejrzyj zwiastun')]

And that is it. 就是这样。 I hope this step-by-step example can make the use of BeautifulSoup clearer for you. 我希望这个循序渐进的例子可以让您更好地使用BeautifulSoup。

This will get you the List You want you'll have to write some code to get rid of the trailing '....'s and to convert the character strings. 这将获得列表您希望您必须编写一些代码来摆脱尾随的'....并转换字符串。

    import urllib2
    from bs4 import BeautifulSoup

     try :
 web_page = urllib2.urlopen("http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO").read()
soup = BeautifulSoup(web_page)
LIST = []
for p in soup.findAll('p'):
    s = p.find('span',{ "class" : 'i' })
    t = p.find('span',{ "class" : 'vi' })
    if s and t:
        p_list = [s.string,t.string]
        LIST.append(p_list)

except urllib2.HTTPError : print("HTTPERROR!") except urllib2.URLError : print("URLERROR!") 除了urllib2.HTTPError:print(“HTTPERROR!”),除了urllib2.URLError:print(“URLERROR!”)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM