简体   繁体   English

使用美丽的汤,抓住之间的东西 <li> 和 </li>

[英]Using Beautiful Soup, grabbing stuff between <li> and </li>

Here is the code I have so far: 这是我到目前为止的代码:

import urllib
from bs4 import BeautifulSoup

lis = []
webpage = urllib.urlopen('http://facts.randomhistory.com/interesting-facts-about-     cats.html')
soup = BeautifulSoup(webpage)
for ul in soup:
    for li in soup.findAll('li'):
        lis.append(li)
    for li in lis:
        print li.text.encode("utf-8")

I'm just trying to get the cat facts from between the opening and closing "li" tags and output them in a way that doesn't look messed up. 我只是试图从打开和关闭“li”标签之间获取cat事实,并以一种看起来没有混乱的方式输出它们。 Currently, the output from this code repeats all of the facts 4 times or so and the word "can't" comes out as "can’t". 目前,此代码的输出重复所有事实4次左右,“不能”这个词出现为“不能”。

I'd appreciate any help. 我很感激任何帮助。

You don't need the outer loop ( for ul in soup ). 你不需要外环( for ul in soup )。 It will output once if you remove it. 如果删除它将输出一次。

soup = BeautifulSoup(webpage)
for li in soup.findAll('li'):
    lis.append(li)
for li in lis:
    print li.text.encode("utf-8")

Its Content-Type says its encoding is ISO-8859-1 , but it is lying. 它的Content-Type表示它的编码是ISO-8859-1 ,但它正在撒谎。 Tell Beautiful Soup to ignore its lies using from_encoding . 使用from_encoding告诉Beautiful Soup忽略它的谎言。 You can make Beautiful Soup do less work by giving it a SoupStrainer for parse_only that selects only things with the content-td class. 你可以通过为parse_only提供一个SoupStrainer来使Beautiful Soup做更少的工作, parse_only选择带有content-td类的东西。 Finally, you can simplify your for loops. 最后,您可以简化for循环。 All together: 全部一起:

import urllib2
import bs4

webpage = urllib2.urlopen('http://facts.randomhistory.com/interesting-facts-about-cats.html')
soup = bs4.BeautifulSoup(webpage, from_encoding='UTF-8',
                         parse_only=bs4.SoupStrainer(attrs='content-td'))
for li in soup('li'):
    print li.text.encode('utf-8')

You can further improve the output by replacing consecutive whitespace with a single space and removing the superscripts. 您可以通过用单个空格替换连续的空格并删除上标来进一步改善输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM