简体   繁体   English

美丽的汤代码会带来意想不到的结果(已编辑)

[英]Beautiful Soup code gives unexpected results (edited)

(The question was edited based on feedback received. I will continue to edit it based on input received until the issue is resolved) (问题是根据收到的反馈进行编辑的。在解决问题之前,我将继续根据收到的输入进行编辑)

I am learning Pyhton and beautiful soup in particular and I am doing the Google Exercise on Regex using the set of html files that contains popular baby names for different years (eg baby1990.html etc). 我正在特别学习Pyhton和漂亮的汤,并且正在使用正则表达式进行一组Google练习,该文件包含一组包含不同年份流行的婴儿名字的html文件(例如baby1990.html等)。 You can find this dataset if you are interested here: https://developers.google.com/edu/python/exercises/baby-names 如果您对此感兴趣,可以找到此数据集: https : //developers.google.com/edu/python/exercises/baby-names

Each html file contains a table with baby names data that looks like this: 每个html文件都包含一个带有婴儿名字数据的表,如下所示:

在此处输入图片说明

Before the table with the baby names there is another table. 带有婴儿名字的表之前有另一个表。 The html code in the Tags of the two tables is respectively the following 这两个表的标签中的html代码分别如下

<table width="100%" border="0" cellspacing="0" cellpadding="4"> # Unwanted table
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">  # targeted table

You may observe that the targeted differs from the unwanted table by the attribute: summary="formatting" 您可能会发现目标对象与不需要的表的属性不同:summary =“ formatting”

The first table--the one we must skip -- has the following html code: 第一个表(我们必须跳过的表)具有以下html代码:

<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr><td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td><td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
  <tr><td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td><td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
      <h1>Popular Names by Birth Year</h1>September 12, 2007</td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>

Within the targeted table the code is the following: 在目标表中,代码如下:

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
<a href="../OACT/babynames/background.html">Background information</a>
<p><br />
&nbsp; Select another <label for="yob">year of birth</label>?<br />      
<form method="post" action="/cgi-bin/popularnames.cgi">
&nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
&nbsp; <input type="submit" value="   Go  "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
<table width="48%" border="1" bordercolor="#aaabbb"
 cellpadding="2" cellspacing="0" summary="Popularity for top 1000">
<tr align="center" valign="bottom">
<th scope="col" width="12%" bgcolor="#efefef">Rank</th>
<th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> # Targeted row
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> # Targeted row
etc...

You can see that the distinctive attribute of the targeted rows is: align = "right". 您可以看到目标行的独特属性是:align =“ right”。

Now the code to extract the content of the targeted cells is the following: 现在,提取目标单元格内容的代码如下:

with open("C:/Users/ALEX/MyFiles/JUPYTER NOTEBOOKS/google-python-exercises/babynames/baby1990.html","r") \
as f: soup = bs(f.read(), 'html.parser') 

print soup.tr
print "number of elemenents in the soup:" , len(soup)

right_table = soup.find("table", summary = "formatting")

print(right_table.prettify())

print "right_table" , len(right_table)

print(right_table[0].prettify())

for row in right_table[1].find_all("tr", allign = "right"):

     cells = row.find_all("td")

     try:
                            print "cells[0]: " , cells[0]
     except:
                            print "cells[0] : NaN"
     try:
                            print "cells[1]: " , cells[1]
     except:
                            print "cells[1] : NaN"    
     try:
                            print "cells[2]: " , cells[2]
     except:
                            print "cells[2] : NaN"

The output is an error message: 输出是错误消息:

    <tr><td align="left" class="sstop" valign="bottom" width="25%">
      Social Security Online
    </td><td class="titletext" valign="bottom">
<!-- sitetitle -->Popular Baby Names
    </td>
</tr>
number of elemenents in the soup: 4
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-116-3ec77a65b5ad> in <module>()
      6 right_table = soup.find("table", summary = "formatting")
      7 
----> 8 print(right_table.prettify())
      9 
     10 print "right_table" , len(right_table)

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in prettify(self, encoding, formatter)
   1198     def prettify(self, encoding=None, formatter="minimal"):
   1199         if encoding is None:
-> 1200             return self.decode(True, formatter=formatter)
   1201         else:
   1202             return self.encode(encoding, True, formatter=formatter)

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode(self, indent_level, eventual_encoding, formatter)
   1164             indent_contents = None
   1165         contents = self.decode_contents(
-> 1166             indent_contents, eventual_encoding, formatter)
   1167 
   1168         if self.hidden:

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode_contents(self, indent_level, eventual_encoding, formatter)
   1233             elif isinstance(c, Tag):
   1234                 s.append(c.decode(indent_level, eventual_encoding,
-> 1235                                   formatter))
   1236             if text and indent_level and not self.name == 'pre':
   1237                 text = text.strip()

... last 2 frames repeated, from the frame below ...

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode(self, indent_level, eventual_encoding, formatter)
   1164             indent_contents = None
   1165         contents = self.decode_contents(
-> 1166             indent_contents, eventual_encoding, formatter)
   1167 
   1168         if self.hidden:

RuntimeError: maximum recursion depth exceeded while calling a Python object

The questions are the following: 问题如下:

  1. Why the code returns the first table -- the unwanted one-- given that we have passed the argument summary = "formatting"? 考虑到我们已经传递了参数summary =“ formatting”,为什么代码返回第一个表-不需要的表?

  2. What the error message implies? 错误消息表示什么? Why it is created? 为什么创建它?

  3. What are other errors you can observe in the code -- if any? 您还能在代码中观察到其他什么错误-如果有的话?

Your advice will be appreciated. 您的建议将不胜感激。

I think you're misreading the attribute searching. 我认为您在误读属性搜索。

If you're looking for 'has summary equal to "Popularity for top 1000"', you should use: 如果您要查找“汇总等于“前1000名人气””,则应使用:

soup.find('table', summary="Popularity for top 1000")

Hopefully that works for you! 希望对您有用!

summary_ = "formatting"
allign_ = "right"

delete the _ , only class_ has _ . 删除_ ,只有class_具有_

It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “ class ”, is a reserved word in Python. 搜索具有特定CSS类的标签非常有用,但是CSS属性的名称“ class ”在Python中是保留字。 Using class as a keyword argument will give you a syntax error. 使用class作为关键字参数会给您带来语法错误。 As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_ 从Beautiful Soup 4.1.2开始,您可以使用关键字参数class_按CSS类进行搜索

with open('/home/li/Downloads/google-python-exercises/babynames/baby2006.html') as f:
    soup = bs4.BeautifulSoup(f, 'lxml')
    table = soup.find(summary="Popularity for top 1000")
    for tr in table.find_all('tr'):
        tds = list(tr.stripped_strings)
        print(tds)

out: 出:

['Rank', 'Male name', 'Female name']
['1', 'Jacob', 'Emily']
['2', 'Michael', 'Emma']
['3', 'Joshua', 'Madison']
['4', 'Ethan', 'Isabella']
['5', 'Matthew', 'Ava']
['6', 'Daniel', 'Abigail']
['7', 'Christopher', 'Olivia']
['8', 'Andrew', 'Hannah']
['9', 'Anthony', 'Sophia']
['10', 'William', 'Samantha']
['11', 'Joseph', 'Elizabeth']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM