Function 打开文件并从 html Python 中提取文本

Question

I'm very new to Python and I'm trying to code a program to extract text inside html tags (without tags) and write it onto a different text file for future analysis.我对 Python 非常陌生，我正在尝试编写一个程序来提取 html 标签（不带标签）中的文本并将其写入不同的文本文件以供将来分析。 I referred this and this as well.我也提到了这个和这个。 I came was able to get below code.我来了能够得到下面的代码。 But how can I write this as a separate function?但是我怎么能把它写成一个单独的 function 呢？ Something like就像是

"def read_list('file1.txt')

and then do the same scraping?然后做同样的刮？ The reason why I'm asking is output of this code (op1.txt) will be used for stemming and then for another calculations afterwards.我问的原因是此代码(op1.txt)的 output 将用于词干提取，然后用于另一次计算。 The output of this code doesn't write line by line as it intends either.此代码的 output 也没有按预期逐行编写。 Thank you very much for any input!非常感谢您的任何意见！

f = open('file1.txt', 'r')
for line in f:
    url = line
    html = urlopen(url)
    bs = BeautifulSoup(html, "html.parser")
    content = bs.find_all(['title','h1', 'h2','h3','h4','h5','h6','p'])

    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')
        file.close()

Answer 1

Try like this像这样试试

from urllib.request import urlopen
from bs4 import BeautifulSoup

def read_list(fl):
    with open(fl, 'r') as f:
        for line in f:
            html = urlopen(line.strip()).read().decode("utf8")
            bs = BeautifulSoup(html, "html.parser")
            content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
        
    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')

Function 打开文件并从 html Python 中提取文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-11-26 09:12:47

Function 打开文件并从 html Python 中提取文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-11-26 09:12:47

解决方案1
0 已采纳 2020-11-26 09:12:47