Python findAll没有在beautifulsoup 3上工作

Question

我试图解析一个html文件并将结果写入csv文件。 html文件是：

<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
    <td><small>15</small></td >
    <td><small><small>Cat</small></small></td>
</tr>
<tr>
    <td><small><small>16</small></small></td>       
    <td><small><small>&nbsp;</small></small></td>       
</tr>
<tr>
    <td><small>17</small></td >
    <td><small><small>Dog</small></small></td>
</tr>
</table>

我的代码是：

import csv
from BeautifulSoup import BeautifulSoup as bs

soup = bs (open("Animals.html"))
for i in soup.findAll('small'):
if "&nbsp;" in i.text:
    i.string = '-'

print soup

f = csv.writer(open("Animals.csv", "a"))   # Open the output file for writing before the loop

trs = soup.findAll('tr')

for tr in trs:

    tds = tr.findAll("td")

    try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
    id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    animal = str(tds[1].get_text())

except:
    print "Bad tr string"
    continue #This tells the computer to move on to the next item after it encounters an error

f.writerow([id, animal])

当我在更换%nbsp;之后打印出汤的内容%nbsp; 我明白了：

<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
    <td><small>15</small></td >
    <td><small></small><small>Cat</small></td > 
</tr>
<tr>
    <td><small><small>16</small></small></td >      
    <td><small></small><small>-</small></td >       
</tr>
<tr>
    <td><small>17</small></td >
    <td><small></small><small>Dog</small></td >
</tr>
</table>

但是当我查看.csv文件时它是空的。 但是，如果我更改代码使用BeautifulSoup 4，那么我无法替换  但结果将保存到.csv文件中。 我使用的其他代码是：

import csv
from bs4 import BeautifulSoup as bs

soup = bs (open("Animals.html"))

f = csv.writer(open("Animals.csv", "w"))   # Open the output file for writing before the loop

trs = soup.find_all('tr')

for tr in trs:

tds = tr.find_all("td")

try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
    id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    animal = str(tds[1].get_text())

except:
    print "Bad tr string"
    continue #This tells the computer to move on to the next item after it encounters an error

f.writerow([id, animal])

那个人不会这样做的原因是因为我想要  要替换为-而且我无法得到（find_all（））与beautifulsoup一起使用4.是什么导致信息被保存到csv文件以及如何修复它（和/或获取它与beautifulsoup合作4）？

Answer 1

BeautifulSoup 4将转换为'nbsp;' 构造'汤'时，转换为Unicode字符\\ xa0。 如果您对该unicode字符进行搜索和替换，它将起作用：

soup = BeautifulSoup(html)

for i in soup.find_all('small'):
    i.string.replace_with(i.string.replace(u'\xa0', '-'))

语法有点冗长。 这是因为i.string不是字符串，而是bs4.element.NavigableString 。 您无法使用简单的i.string.replace(...)来编辑这些内容; 相反，你必须调用beautifulsoup自己的replace_with方法。

replace_with只接受一个参数，所以我们必须生成字符串的新版本并将其传入。为此，我们可以使用Python内置的字符串replace方法，去除u'\\ xa0'字符并将其替换为' - ”

或者，您可以在原始HTML上使用正则表达式。 如果您只需要更换所有  具有-实例：

import re

newhtml = re.sub(r'&nbsp;', '-', html)

虽然您可以进一步自定义，但它只影响<small>标签 - 如果您希望将此添加到答案中，请告诉我。

Python findAll没有在beautifulsoup 3上工作

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-01-26 11:15:31

Python findAll没有在beautifulsoup 3上工作

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-01-26 11:15:31

解决方案1
2 已采纳 2014-01-26 11:15:31