抓取網站並將僅可見文本導出到文本文檔Python 3（Beautiful Soup）

Question

問題：我正嘗試使用beautifulsoup刮刮多個網站，僅將可見文本用於其中，然后將所有數據導出到單個文本文件中。

該文件將用作使用NLTK查找搭配的語料庫。 到目前為止，我正在使用類似的東西，但是任何幫助將不勝感激！

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
    for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
    for item in text:
        print(file, item)

不幸的是，這有兩個問題：當我嘗試將文件導出到.txt文件時，它完全空白。

有任何想法嗎？

Answer 1

print(file, item)應該是print(item, file=file) 。

但是不要命名您的文件file因為這會掩蓋內置file ，這樣更好：

with open('thisisanew.txt','w') as outfile:
    for item in text:
        print(item, file=outfile)

為了解決下一個問題，覆蓋第一個URL中的數據，您可以將文件寫入代碼移入循環，並在進入循環之前打開文件一次：

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
    for url in urls:
        website = requests.get(url)
        soup = BeautifulSoup(website.content)
        text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
        for item in text:
            print(item, file=outfile)

Answer 2

還有另一個問題：您僅從最后一個URL收集文本：一遍又一遍地重新分配text變量。

在循環之前將text定義為空列表，並在其中添加新數據：

text = []
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]

抓取網站並將僅可見文本導出到文本文檔Python 3（Beautiful Soup）

問題描述

2 個解決方案

解決方案1
3 已采納 2014-09-02 02:42:00

解決方案2
1 2014-09-02 02:45:16

抓取網站並將僅可見文本導出到文本文檔Python 3（Beautiful Soup）

問題描述

2 個解決方案

解決方案1 3 已采納 2014-09-02 02:42:00

解決方案2 1 2014-09-02 02:45:16

解決方案1
3 已采納 2014-09-02 02:42:00

解決方案2
1 2014-09-02 02:45:16