BeautifulSoup：從目錄中的文件中剝離html元素並將內容寫入文件

Question

我正在嘗試打開目錄中的所有html文件（到目前為止很好），在每個目錄中找到頁腳元素（也很好），刪除頁腳（沒有骰子），然后將結果寫回html文件中，而沒有頁腳（也沒有骰子）。

這是我得到的：

from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
from os import listdir
from os import chdir

def main():
    # move into the nohead directory
    chdir('nohead')

    # get a list of the files in nohead
    filenames=listdir('.')


    for files in filenames:              
        soup = BeautifulSoup (open(files))
        bottom = soup.findAll("footer")  
            nothing = ""
            bottom.replaceWith(nothing)
    # and then I'd like to save each separate html file with its <footer> removed

if __name__ == "__main__":
  main()

這給了我以下錯誤：

    AttributeError: 'list' object has no attribute 'replaceWith'

我也嘗試過

  for files in filenames:                       
      soup = BeautifulSoup (open(files, "w+"))  
      bottom = soup.findAll("footer")           
      decompose(bottom)

這給了我以下錯誤：

    NameError: global name 'decompose' is not defined

我很高興為這個問題提供BeautifulSoup3或bs4解決方案，尤其是如果有一種方法可以將每個html文件另存為單獨的文件，並且刪除其頁腳。

Answer 1

您需要更改為-

for files in filenames:              
    soup = BeautifulSoup (open(files))
    bottom = soup.findAll("footer")
    for single_footer in bottom:
        single_footer.decompose()
        #Then save

關於使用什么os.walk -traverse一個目錄，並改變所有的文件都頁腳如下─

from bs4 import BeautifulSoup as bs
import os

input_dir = r"C:\Users\User\Desktop\test"

for root,dirs,files in os.walk(input_dir):
    for single_file in files:
        with open(os.path.join(root,single_file),'r+') as inpt:
            soup = bs(inpt.read(),'lxml')
            if len(soup.findAll('footer'))>0:
                for footer in soup.findAll('footer'):
                    footer.decompose()
                inpt.seek(0)#rewind
                inpt.write(soup.encode('utf-8'))

Answer 2

要在BeautifulSoup中刪除標簽，您應該使用分解。 在您的情況下，應為：

import codecs
for files in filenames:              
    soup = BeautifulSoup (open(files))
    soup.footer.decompose()
    f=codecs.open("abc1.html", mode="w", encoding="utf-8")
    f.write(soup.prettify())
    f.close()

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

BeautifulSoup：從目錄中的文件中剝離html元素並將內容寫入文件

問題描述

2 個解決方案

解決方案1
0 2016-02-09 05:18:07

解決方案2
0 2016-02-09 05:29:44

BeautifulSoup：從目錄中的文件中剝離html元素並將內容寫入文件

問題描述

2 個解決方案

解決方案1 0 2016-02-09 05:18:07

解決方案2 0 2016-02-09 05:29:44

解決方案1
0 2016-02-09 05:18:07

解決方案2
0 2016-02-09 05:29:44