简体   繁体   English

抓取网站并将仅可见文本导出到文本文档Python 3(Beautiful Soup)

[英]Scrape websites and export only the visible text to a text document Python 3 (Beautiful Soup)

Problem: I am trying to scrape multiple websites using beautifulsoup for only the visible text and then export all of the data to a single text file. 问题:我正尝试使用beautifulsoup刮刮多个网站,仅将可见文本用于其中,然后将所有数据导出到单个文本文件中。

This file will be used as a corpus for finding collocations using NLTK. 该文件将用作使用NLTK查找搭配的语料库。 I'm working with something like this so far but any help would be much appreciated! 到目前为止,我正在使用类似的东西,但是任何帮助将不胜感激!

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
    for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
    for item in text:
        print(file, item)

Unfortunately, there are two issues with this: when I try to export the file to a .txt file it is completely blank. 不幸的是,这有两个问题:当我尝试将文件导出到.txt文件时,它完全空白。

Any ideas? 有任何想法吗?

print(file, item) should be print(item, file=file) . print(file, item)应该是print(item, file=file)

But don't name your files file as this shadows the file builtin, something like this is better: 但是不要命名您的文件file因为这会掩盖内置file ,这样更好:

with open('thisisanew.txt','w') as outfile:
    for item in text:
        print(item, file=outfile)

To solve the next problem, overwriting the data from the first URL, you can move the file writing code into your loop, and open the file once before entering the loop: 为了解决下一个问题,覆盖第一个URL中的数据,您可以将文件写入代码移入循环,并在进入循环之前打开文件一次:

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
    for url in urls:
        website = requests.get(url)
        soup = BeautifulSoup(website.content)
        text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
        for item in text:
            print(item, file=outfile)

There is another problem: you are collecting the text only from the last url: reassigning the text variable over and over. 还有另一个问题:您仅从最后一个URL收集文本:一遍又一遍地重新分配text变量。

Define the text as an empty list before the loop and add new data to it inside: 在循环之前将text定义为空列表,并在其中添加新数据:

text = []
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM