简体   繁体   English

Function 打开文件并从 html Python 中提取文本

[英]Function to open a file and extract text from html Python

I'm very new to Python and I'm trying to code a program to extract text inside html tags (without tags) and write it onto a different text file for future analysis.我对 Python 非常陌生,我正在尝试编写一个程序来提取 html 标签(不带标签)中的文本并将其写入不同的文本文件以供将来分析。 I referred this and this as well.我也提到了这个这个 I came was able to get below code.我来了能够得到下面的代码。 But how can I write this as a separate function?但是我怎么能把它写成一个单独的 function 呢? Something like就像是

"def read_list('file1.txt')

and then do the same scraping?然后做同样的刮? The reason why I'm asking is output of this code (op1.txt) will be used for stemming and then for another calculations afterwards.我问的原因是此代码(op1.txt)的 output 将用于词干提取,然后用于另一次计算。 The output of this code doesn't write line by line as it intends either.此代码的 output 也没有按预期逐行编写。 Thank you very much for any input!非常感谢您的任何意见!

f = open('file1.txt', 'r')
for line in f:
    url = line
    html = urlopen(url)
    bs = BeautifulSoup(html, "html.parser")
    content = bs.find_all(['title','h1', 'h2','h3','h4','h5','h6','p'])

    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')
        file.close()

Try like this像这样试试

from urllib.request import urlopen
from bs4 import BeautifulSoup

def read_list(fl):
    with open(fl, 'r') as f:
        for line in f:
            html = urlopen(line.strip()).read().decode("utf8")
            bs = BeautifulSoup(html, "html.parser")
            content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
        
    with open('op1.txt', 'w', encoding='utf-8') as file:
        file.write(f'{content}\n\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用BeautifulSoup / Python从html文件中提取文本 - Extract text from html file with BeautifulSoup/Python 从html文件python中提取文本 - extract text from html file python Python Html:从html文件提取文本部分 - Python Html: Extract Parts of Text from html file 如何通过python从html文件中的javascript句子中提取此类文本 - How to extract such text from javascript sentences in a html file by python 在检查粗体时从 HTML 文件中提取所有文本(Python) - Extract all text from HTML file while checking for boldness (Python) 如何使用html文件中的lxml在python中提取段落文本? - How to extract paragraph text in python using lxml from html file? 使用 BeatifulSoup Python 从 HTML 文件中仅提取文本 - Extract only text from HTML file with BeatifulSoup Python 从.html文件中提取文本,删除HTML,然后使用Python和Beautiful Soup写入文本文件 - Extract text from .html file, remove HTML, and write to text file using Python and Beautiful Soup 如何使用函数从文本文件中提取信息。 Python 3 - How to extract info from a text file using a function. Python 3 Python:读取本地HTML文件,使用findall函数将文本提取到新的HTML文件中 - Python: reading local HTML files, using findall function to extract text into new HTML file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM