简体   繁体   中英

Extracting only words from html pages

I am using python 2.7 and I have a folder with a list of html pages from which i would like to extract only the words from. Currently, the process that I am using is open the html file, run it through beautiful soup library, get the text and write it to a new file. But the problem here is I still get javascript, css (body, colour, #000000 .etc), symbols (|, `,~,[] .etc) and random numbers in the output.

How do I get rid of the unwanted output and get text only?

path = *folder path*
raw = open(path + "/raw.txt", "w")
files = os.listdir(path)
for name in files:
    fname = os.path.join(path, name)
    try:
        with open(fname) as f:
            b = f.read()
            soup = BeautifulSoup(b)
            txt = soup.body.getText().encode("UTF-8")
            raw.write(txt)

Could strip out the script and style tags

import requests
from bs4 import BeautifulSoup

session = requests.session()

soup = BeautifulSoup(session.get('http://stackoverflow.com/questions/27684020/extracting-only-words- from-html-pages').text)

#This part here will strip out the script and style tags.
for script in soup(["script", "style"]):
script.extract()

print soup.get_text()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM