[英]Python findAll not working on beautifulsoup 3
我试图解析一个html文件并将结果写入csv文件。 html文件是:
<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
<td><small>15</small></td >
<td><small><small>Cat</small></small></td>
</tr>
<tr>
<td><small><small>16</small></small></td>
<td><small><small> </small></small></td>
</tr>
<tr>
<td><small>17</small></td >
<td><small><small>Dog</small></small></td>
</tr>
</table>
我的代码是:
import csv
from BeautifulSoup import BeautifulSoup as bs
soup = bs (open("Animals.html"))
for i in soup.findAll('small'):
if " " in i.text:
i.string = '-'
print soup
f = csv.writer(open("Animals.csv", "a")) # Open the output file for writing before the loop
trs = soup.findAll('tr')
for tr in trs:
tds = tr.findAll("td")
try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
animal = str(tds[1].get_text())
except:
print "Bad tr string"
continue #This tells the computer to move on to the next item after it encounters an error
f.writerow([id, animal])
当我在更换%nbsp;
之后打印出汤的内容%nbsp;
我明白了:
<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
<td><small>15</small></td >
<td><small></small><small>Cat</small></td >
</tr>
<tr>
<td><small><small>16</small></small></td >
<td><small></small><small>-</small></td >
</tr>
<tr>
<td><small>17</small></td >
<td><small></small><small>Dog</small></td >
</tr>
</table>
但是当我查看.csv文件时它是空的。 但是,如果我更改代码使用BeautifulSoup 4,那么我无法替换
但结果将保存到.csv文件中。 我使用的其他代码是:
import csv
from bs4 import BeautifulSoup as bs
soup = bs (open("Animals.html"))
f = csv.writer(open("Animals.csv", "w")) # Open the output file for writing before the loop
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
animal = str(tds[1].get_text())
except:
print "Bad tr string"
continue #This tells the computer to move on to the next item after it encounters an error
f.writerow([id, animal])
那个人不会这样做的原因是因为我想要
要替换为-
而且我无法得到(find_all())与beautifulsoup一起使用4.是什么导致信息被保存到csv文件以及如何修复它(和/或获取它与beautifulsoup合作4)?
BeautifulSoup 4将转换为'nbsp;' 构造'汤'时,转换为Unicode字符\\ xa0。 如果您对该unicode字符进行搜索和替换,它将起作用:
soup = BeautifulSoup(html)
for i in soup.find_all('small'):
i.string.replace_with(i.string.replace(u'\xa0', '-'))
语法有点冗长。 这是因为i.string
不是字符串,而是bs4.element.NavigableString
。 您无法使用简单的i.string.replace(...)
来编辑这些内容; 相反,你必须调用beautifulsoup自己的replace_with
方法。
replace_with
只接受一个参数,所以我们必须生成字符串的新版本并将其传入。为此,我们可以使用Python内置的字符串replace
方法,去除u'\\ xa0'字符并将其替换为' - ”
或者,您可以在原始HTML上使用正则表达式。 如果您只需要更换所有
具有-
实例:
import re
newhtml = re.sub(r' ', '-', html)
虽然您可以进一步自定义,但它只影响<small>
标签 - 如果您希望将此添加到答案中,请告诉我。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.