繁体   English   中英

抓取网站并将仅可见文本导出到文本文档Python 3(Beautiful Soup)

[英]Scrape websites and export only the visible text to a text document Python 3 (Beautiful Soup)

问题:我正尝试使用beautifulsoup刮刮多个网站,仅将可见文本用于其中,然后将所有数据导出到单个文本文件中。

该文件将用作使用NLTK查找搭配的语料库。 到目前为止,我正在使用类似的东西,但是任何帮助将不胜感激!

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
    for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
    for item in text:
        print(file, item)

不幸的是,这有两个问题:当我尝试将文件导出到.txt文件时,它完全空白。

有任何想法吗?

print(file, item)应该是print(item, file=file)

但是不要命名您的文件file因为这会掩盖内置file ,这样更好:

with open('thisisanew.txt','w') as outfile:
    for item in text:
        print(item, file=outfile)

为了解决下一个问题,覆盖第一个URL中的数据,您可以将文件写入代码移入循环,并在进入循环之前打开文件一次:

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
    for url in urls:
        website = requests.get(url)
        soup = BeautifulSoup(website.content)
        text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
        for item in text:
            print(item, file=outfile)

还有另一个问题:您仅从最后一个URL收集文本:一遍又一遍地重新分配text变量。

在循环之前将text定义为空列表,并在其中添加新数据:

text = []
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM