简体   繁体   English

Python findAll没有在beautifulsoup 3上工作

[英]Python findAll not working on beautifulsoup 3

I am trying to parse a html file and write the results to a csv file. 我试图解析一个html文件并将结果写入csv文件。 The html file is: html文件是:

<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
    <td><small>15</small></td >
    <td><small><small>Cat</small></small></td>
</tr>
<tr>
    <td><small><small>16</small></small></td>       
    <td><small><small>&nbsp;</small></small></td>       
</tr>
<tr>
    <td><small>17</small></td >
    <td><small><small>Dog</small></small></td>
</tr>
</table>

and the code I have atm is: 我的代码是:

import csv
from BeautifulSoup import BeautifulSoup as bs

soup = bs (open("Animals.html"))
for i in soup.findAll('small'):
if "&nbsp;" in i.text:
    i.string = '-'

print soup

f = csv.writer(open("Animals.csv", "a"))   # Open the output file for writing before the loop

trs = soup.findAll('tr')

for tr in trs:

    tds = tr.findAll("td")

    try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
    id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    animal = str(tds[1].get_text())

except:
    print "Bad tr string"
    continue #This tells the computer to move on to the next item after it encounters an error

f.writerow([id, animal])

When I print out the contents of soup after replacing the %nbsp; 当我在更换%nbsp;之后打印出汤的内容%nbsp; I get: 我明白了:

<table BORDER='1' CELLSPACING='0' CELLPADDING='0'>
<tr>
    <td><small>15</small></td >
    <td><small></small><small>Cat</small></td > 
</tr>
<tr>
    <td><small><small>16</small></small></td >      
    <td><small></small><small>-</small></td >       
</tr>
<tr>
    <td><small>17</small></td >
    <td><small></small><small>Dog</small></td >
</tr>
</table>

But when I look at the .csv file it is empty. 但是当我查看.csv文件时它是空的。 However if I change the code to use BeautifulSoup 4, then I can't replace the &nbsp; 但是,如果我更改代码使用BeautifulSoup 4,那么我无法替换&nbsp; but the results will be saved to the .csv file. 但结果将保存到.csv文件中。 The other code that I use is: 我使用的其他代码是:

import csv
from bs4 import BeautifulSoup as bs

soup = bs (open("Animals.html"))

f = csv.writer(open("Animals.csv", "w"))   # Open the output file for writing before the loop

trs = soup.find_all('tr')

for tr in trs:

tds = tr.find_all("td")

try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error.
    id = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    animal = str(tds[1].get_text())

except:
    print "Bad tr string"
    continue #This tells the computer to move on to the next item after it encounters an error

f.writerow([id, animal])

The reason why that one won't do me is because I want the &nbsp; 那个人不会这样做的原因是因为我想要&nbsp; to be replaced with - and I haven't been able to get that (the find_all()) to work with beautifulsoup 4. What is causing the information to be saved to the csv file and how can I fix it (and/or get it working with beautifulsoup 4)? 要替换为-而且我无法得到(find_all())与beautifulsoup一起使用4.是什么导致信息被保存到csv文件以及如何修复它(和/或获取它与beautifulsoup合作4)?

BeautifulSoup 4 will convert that 'nbsp;' BeautifulSoup 4将转换为'nbsp;' into a Unicode character \\xa0 when the 'soup' is constructed. 构造'汤'时,转换为Unicode字符\\ xa0。 If you search-and-replace on that unicode character it will work: 如果您对该unicode字符进行搜索和替换,它将起作用:

soup = BeautifulSoup(html)

for i in soup.find_all('small'):
    i.string.replace_with(i.string.replace(u'\xa0', '-'))

The syntax there is a little verbose. 语法有点冗长。 This is because i.string is not a string, but a bs4.element.NavigableString . 这是因为i.string不是字符串,而是bs4.element.NavigableString You can't edit these in place with a straightforward i.string.replace(...) ; 您无法使用简单的i.string.replace(...)来编辑这些内容; instead you must call beautifulsoup's own replace_with method. 相反,你必须调用beautifulsoup自己的replace_with方法。

replace_with accepts just one argument, so we have to generate the new version of the string and pass it in. For this we can use Python's built in replace method for strings, to strip out the u'\\xa0' characters and replace them with '-' replace_with只接受一个参数,所以我们必须生成字符串的新版本并将其传入。为此,我们可以使用Python内置的字符串replace方法,去除u'\\ xa0'字符并将其替换为' - ”

Alternatively, you could just use regular expressions on the original HTML. 或者,您可以在原始HTML上使用正则表达式。 If all you need is to replace all &nbsp; 如果您只需要更换所有&nbsp; instances with a - : 具有-实例:

import re

newhtml = re.sub(r'&nbsp;', '-', html)

Though you could customise this further so it only affects <small> tags - let me know if you'd like this added to the answer. 虽然您可以进一步自定义,但它只影响<small>标签 - 如果您希望将此添加到答案中,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM