[英]Scrape websites and export only the visible text to a text document Python 3 (Beautiful Soup)
问题:我正尝试使用beautifulsoup刮刮多个网站,仅将可见文本用于其中,然后将所有数据导出到单个文本文件中。
该文件将用作使用NLTK查找搭配的语料库。 到目前为止,我正在使用类似的东西,但是任何帮助将不胜感激!
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
for item in text:
print(file, item)
不幸的是,这有两个问题:当我尝试将文件导出到.txt文件时,它完全空白。
有任何想法吗?
print(file, item)
应该是print(item, file=file)
。
但是不要命名您的文件file
因为这会掩盖内置file
,这样更好:
with open('thisisanew.txt','w') as outfile:
for item in text:
print(item, file=outfile)
为了解决下一个问题,覆盖第一个URL中的数据,您可以将文件写入代码移入循环,并在进入循环之前打开文件一次:
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print(item, file=outfile)
还有另一个问题:您仅从最后一个URL收集文本:一遍又一遍地重新分配text
变量。
在循环之前将text
定义为空列表,并在其中添加新数据:
text = []
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.