在Python中从多个网页刮取文本

Question

我的任务是从我们主机的某个客户端的任何网页上删除所有文本。 我已经设法编写了一个脚本，可以从单个网页上删除文本，每次你想要刮取不同的网页时，你都可以手动替换代码中的URL。 但显然这是非常低效的。 理想情况下，我可以将Python连接到包含我需要的所有URL的某个列表，它将遍历列表并将所有已删除的文本打印到单个CSV中。 我试图通过创建一个2 URL长列表并尝试获取我的代码来抓取这两个URL来编写此代码的“测试”版本。 但是，正如您所看到的，我的代码只会抓取列表中最新的URL并且不会保留它所抓取的第一个页面。 我认为这是由于我的print语句不足，因为它总是会自行写出来。 有没有办法让我抓到的东西保持在某个地方，直到循环遍历整个列表然后打印所有内容。

随意完全拆除我的代码。 我对计算机语言一无所知。 我只是继续分配这些任务并使用Google尽我所能。

import urllib
import re
from bs4 import BeautifulSoup

data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv'
urlTable = ['url1','url2']

def extractText(string):
    page = urllib.request.urlopen(string)
    soup = BeautifulSoup(page, 'html.parser')

##Extracts all paragraph and header variables from URL as GroupObjects
    text = soup.find_all("p")
    headers1 = soup.find_all("h1")
    headers2 = soup.find_all("h2")
    headers3 = soup.find_all("h3")

##Forces GroupObjects into str
    text = str(text)
    headers1 = str(headers1)
    headers2 = str(headers2)
    headers3 = str(headers3)

##Strips HTML tags and brackets from extracted strings
    text = text.strip('[')
    text = text.strip(']')
    text = re.sub('<[^<]+?>', '', text)

    headers1 = headers1.strip('[')
    headers1 = headers1.strip(']')
    headers1 = re.sub('<[^<]+?>', '', headers1)

    headers2 = headers2.strip('[')
    headers2 = headers2.strip(']')
    headers2 = re.sub('<[^<]+?>', '', headers2)

    headers3 = headers3.strip('[')
    headers3 = headers3.strip(']')
    headers3 = re.sub('<[^<]+?>', '', headers3)

    print_to_file = open (data_file_name, 'w' , encoding = 'utf')
    print_to_file.write(text + headers1 + headers2 + headers3)
    print_to_file.close()


for i in urlTable:
    extractText (i)

Answer 1

试试这个，'w'将在文件开头用指针打开文件。 您希望指针位于文件末尾

print_to_file = open (data_file_name, 'a' , encoding = 'utf')

这里是所有不同的读写模式，供将来参考

The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.

在Python中从多个网页刮取文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-08-04 19:52:25

在Python中从多个网页刮取文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-08-04 19:52:25

解决方案1
0 已采纳 2016-08-04 19:52:25