[英]Function to open a file and extract text from html Python
I'm very new to Python and I'm trying to code a program to extract text inside html tags (without tags) and write it onto a different text file for future analysis.我对 Python 非常陌生,我正在尝试编写一个程序来提取 html 标签(不带标签)中的文本并将其写入不同的文本文件以供将来分析。 I referred this and this as well.
我也提到了这个和 这个。 I came was able to get below code.
我来了能够得到下面的代码。 But how can I write this as a separate function?
但是我怎么能把它写成一个单独的 function 呢? Something like
就像是
"def read_list('file1.txt')
and then do the same scraping?然后做同样的刮? The reason why I'm asking is output of this code
(op1.txt)
will be used for stemming and then for another calculations afterwards.我问的原因是此代码
(op1.txt)
的 output 将用于词干提取,然后用于另一次计算。 The output of this code doesn't write line by line as it intends either.此代码的 output 也没有按预期逐行编写。 Thank you very much for any input!
非常感谢您的任何意见!
f = open('file1.txt', 'r')
for line in f:
url = line
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
content = bs.find_all(['title','h1', 'h2','h3','h4','h5','h6','p'])
with open('op1.txt', 'w', encoding='utf-8') as file:
file.write(f'{content}\n\n')
file.close()
Try like this像这样试试
from urllib.request import urlopen
from bs4 import BeautifulSoup
def read_list(fl):
with open(fl, 'r') as f:
for line in f:
html = urlopen(line.strip()).read().decode("utf8")
bs = BeautifulSoup(html, "html.parser")
content = '\n'.join([x.text for x in bs.find_all(['title','p']+[f'h{n}' for n in range(1,7)])])
with open('op1.txt', 'w', encoding='utf-8') as file:
file.write(f'{content}\n\n')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.