簡體   English   中英

BeautifulSoup:從目錄中的文件中剝離html元素並將內容寫入文件

[英]BeautifulSoup: Striping an html element from files in a directory and writing the contents to a file

我正在嘗試打開目錄中的所有html文件(到目前為止很好),在每個目錄中找到頁腳元素(也很好),刪除頁腳(沒有骰子),然后將結果寫回html文件中,而沒有頁腳(也沒有骰子)。

這是我得到的:

from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
from os import listdir
from os import chdir

def main():
    # move into the nohead directory
    chdir('nohead')

    # get a list of the files in nohead
    filenames=listdir('.')


    for files in filenames:              
        soup = BeautifulSoup (open(files))
        bottom = soup.findAll("footer")  
            nothing = ""
            bottom.replaceWith(nothing)
    # and then I'd like to save each separate html file with its <footer> removed

if __name__ == "__main__":
  main()                                                               

這給了我以下錯誤:

    AttributeError: 'list' object has no attribute 'replaceWith'

我也嘗試過

  for files in filenames:                       
      soup = BeautifulSoup (open(files, "w+"))  
      bottom = soup.findAll("footer")           
      decompose(bottom) 

這給了我以下錯誤:

    NameError: global name 'decompose' is not defined 

我很高興為這個問題提供BeautifulSoup3或bs4解決方案,尤其是如果有一種方法可以將每個html文件另存為單獨的文件,並且刪除其頁腳。

您需要更改為-

for files in filenames:              
    soup = BeautifulSoup (open(files))
    bottom = soup.findAll("footer")
    for single_footer in bottom:
        single_footer.decompose()
        #Then save

關於使用什么os.walk -traverse一個目錄,並改變所有的文件都頁腳如下─

from bs4 import BeautifulSoup as bs
import os

input_dir = r"C:\Users\User\Desktop\test"

for root,dirs,files in os.walk(input_dir):
    for single_file in files:
        with open(os.path.join(root,single_file),'r+') as inpt:
            soup = bs(inpt.read(),'lxml')
            if len(soup.findAll('footer'))>0:
                for footer in soup.findAll('footer'):
                    footer.decompose()
                inpt.seek(0)#rewind
                inpt.write(soup.encode('utf-8'))

要在BeautifulSoup中刪除標簽,您應該使用分解。 在您的情況下,應為:

import codecs
for files in filenames:              
    soup = BeautifulSoup (open(files))
    soup.footer.decompose()
    f=codecs.open("abc1.html", mode="w", encoding="utf-8")
    f.write(soup.prettify())
    f.close()

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM