繁体   English   中英

Python findAll没有在beautifulsoup 3上工作

[英]Python findAll not working on beautifulsoup 3

我试图解析一个html文件并将结果写入csv文件。 html文件是:

<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
    <td><small>15</small></td >
    <td><small><small>Cat</small></small></td>
</tr>
<tr>
    <td><small><small>16</small></small></td>       
    <td><small><small>&nbsp;</small></small></td>       
</tr>
<tr>
    <td><small>17</small></td >
    <td><small><small>Dog</small></small></td>
</tr>
</table>

我的代码是:

import csv
from BeautifulSoup import BeautifulSoup as bs

soup = bs (open("Animals.html"))
for i in soup.findAll('small'):
if "&nbsp;" in i.text:
    i.string = '-'

print soup

f = csv.writer(open("Animals.csv", "a"))   # Open the output file for writing before the loop

trs = soup.findAll('tr')

for tr in trs:

    tds = tr.findAll("td")

    try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
    id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    animal = str(tds[1].get_text())

except:
    print "Bad tr string"
    continue #This tells the computer to move on to the next item after it encounters an error

f.writerow([id, animal])

当我在更换%nbsp;之后打印出汤的内容%nbsp; 我明白了:

<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
    <td><small>15</small></td >
    <td><small></small><small>Cat</small></td > 
</tr>
<tr>
    <td><small><small>16</small></small></td >      
    <td><small></small><small>-</small></td >       
</tr>
<tr>
    <td><small>17</small></td >
    <td><small></small><small>Dog</small></td >
</tr>
</table>

但是当我查看.csv文件时它是空的。 但是,如果我更改代码使用BeautifulSoup 4,那么我无法替换&nbsp; 但结果将保存到.csv文件中。 我使用的其他代码是:

import csv
from bs4 import BeautifulSoup as bs

soup = bs (open("Animals.html"))

f = csv.writer(open("Animals.csv", "w"))   # Open the output file for writing before the loop

trs = soup.find_all('tr')

for tr in trs:

tds = tr.find_all("td")

try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
    id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    animal = str(tds[1].get_text())

except:
    print "Bad tr string"
    continue #This tells the computer to move on to the next item after it encounters an error

f.writerow([id, animal])

那个人不会这样做的原因是因为我想要&nbsp; 要替换为-而且我无法得到(find_all())与beautifulsoup一起使用4.是什么导致信息被保存到csv文件以及如何修复它(和/或获取它与beautifulsoup合作4)?

BeautifulSoup 4将转换为'nbsp;' 构造'汤'时,转换为Unicode字符\\ xa0。 如果您对该unicode字符进行搜索和替换,它将起作用:

soup = BeautifulSoup(html)

for i in soup.find_all('small'):
    i.string.replace_with(i.string.replace(u'\xa0', '-'))

语法有点冗长。 这是因为i.string不是字符串,而是bs4.element.NavigableString 您无法使用简单的i.string.replace(...)来编辑这些内容; 相反,你必须调用beautifulsoup自己的replace_with方法。

replace_with只接受一个参数,所以我们必须生成字符串的新版本并将其传入。为此,我们可以使用Python内置的字符串replace方法,去除u'\\ xa0'字符并将其替换为' - ”

或者,您可以在原始HTML上使用正则表达式。 如果您只需要更换所有&nbsp; 具有-实例:

import re

newhtml = re.sub(r'&nbsp;', '-', html)

虽然您可以进一步自定义,但它只影响<small>标签 - 如果您希望将此添加到答案中,请告诉我。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM