[英]BeautifulSoup: Striping an html element from files in a directory and writing the contents to a file
我正在嘗試打開目錄中的所有html文件(到目前為止很好),在每個目錄中找到頁腳元素(也很好),刪除頁腳(沒有骰子),然后將結果寫回html文件中,而沒有頁腳(也沒有骰子)。
這是我得到的:
from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
from os import listdir
from os import chdir
def main():
# move into the nohead directory
chdir('nohead')
# get a list of the files in nohead
filenames=listdir('.')
for files in filenames:
soup = BeautifulSoup (open(files))
bottom = soup.findAll("footer")
nothing = ""
bottom.replaceWith(nothing)
# and then I'd like to save each separate html file with its <footer> removed
if __name__ == "__main__":
main()
這給了我以下錯誤:
AttributeError: 'list' object has no attribute 'replaceWith'
我也嘗試過
for files in filenames:
soup = BeautifulSoup (open(files, "w+"))
bottom = soup.findAll("footer")
decompose(bottom)
這給了我以下錯誤:
NameError: global name 'decompose' is not defined
我很高興為這個問題提供BeautifulSoup3或bs4解決方案,尤其是如果有一種方法可以將每個html文件另存為單獨的文件,並且刪除其頁腳。
您需要更改為-
for files in filenames:
soup = BeautifulSoup (open(files))
bottom = soup.findAll("footer")
for single_footer in bottom:
single_footer.decompose()
#Then save
關於使用什么os.walk
-traverse一個目錄,並改變所有的文件都頁腳如下─
from bs4 import BeautifulSoup as bs
import os
input_dir = r"C:\Users\User\Desktop\test"
for root,dirs,files in os.walk(input_dir):
for single_file in files:
with open(os.path.join(root,single_file),'r+') as inpt:
soup = bs(inpt.read(),'lxml')
if len(soup.findAll('footer'))>0:
for footer in soup.findAll('footer'):
footer.decompose()
inpt.seek(0)#rewind
inpt.write(soup.encode('utf-8'))
要在BeautifulSoup中刪除標簽,您應該使用分解。 在您的情況下,應為:
import codecs
for files in filenames:
soup = BeautifulSoup (open(files))
soup.footer.decompose()
f=codecs.open("abc1.html", mode="w", encoding="utf-8")
f.write(soup.prettify())
f.close()
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.