Python 将列表从文件传递到 requests.get()

Question

I'm trying to scrape a corpus of news article for analysis.我正在尝试抓取新闻文章的语料库进行分析。 I have a text file with a list of URL's and I'm trying to pass these to requests so that the page can be scraped with BeautifulSoup.我有一个带有 URL 列表的文本文件，我正在尝试将这些传递给请求，以便可以使用 BeautifulSoup 抓取该页面。 I can pull the urls from the text file.我可以从文本文件中提取网址。 However, I'm not properly passing that outputto requests.get().但是，我没有正确地将输出传递给 requests.get()。 When I give requests.get() an explicit url, the script works fine.当我给 requests.get() 一个明确的 url 时，脚本工作正常。 How do I properly pass to requests.get() a list of links from a text file?如何正确地将文本文件中的链接列表传递给 requests.get()？ Here is what I have working.这是我的工作。

import requests
from bs4 import BeautifulSoup
r = requests.get("https://examplewebsite.org/page1")
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
file = open("output.txt", "w")
file.write("ITEM:")
paragraphs = soup.find_all("p")[11:-10]
for paragraph in paragraphs:
    file.write(paragraph.get_text())
    file.write("\n")
    file.write("\n")
file.close()

However, when I try to read from a text file that is a list of links, there seems to be a problem with how I'm passing the links to requests.get().但是，当我尝试从作为链接列表的文本文件中读取时，我将链接传递给 requests.get() 的方式似乎存在问题。 With one URL per line, the text file list of links looks like每行一个 URL，链接的文本文件列表看起来像

https://examplewebsite.org/page1 https://examplewebsite.org/page1
https://examplewebsite.org/page2 https://examplewebsite.org/page2
https://examplewebsite.org/page3 https://examplewebsite.org/page3
https://examplewebsite.org/page4 https://examplewebsite.org/page4

Here is how I'm trying to work through the list of links.这是我试图通过链接列表工作的方式。

f = open('article-list.txt', 'r')
urls = list(f)
for url in urls:
    import requests
    from bs4 import BeautifulSoup
    r = requests.get(url)
    coverpage = r.content
    soup = BeautifulSoup(coverpage, 'html5lib')
    file = open("output.txt", "w")
    file.write("ITEM:")
    paragraphs = soup.find_all("p")[11:-10]
    for paragraph in paragraphs:
        file.write(paragraph.get_text())
        file.write("\n")
        file.write("\n")
        print(paragraph.get_text())
file.close()

What I get is an error saying我得到的是一个错误说

AttributeError: 'NoneType' object has no attribute 'get_text'

This suggests to me I'm not properly passing the request.这表明我没有正确传递请求。 If I simply swap an explicitly defined url like " https://somewebsite.org/page1 " then the script works and writes paragraphs to the file.如果我只是简单地交换一个明确定义的 url ，例如“ https://somewebsite.org/page1 ”，那么脚本可以工作并将段落写入文件。 Yet when I put a print(urls) statement at the top and give requests.get() an explicit link so it does not break, I get a list of urls.然而，当我在顶部放置一个print(urls)语句并给 requests.get() 一个显式链接以使其不会中断时，我得到了一个 url 列表。 However, that list is formatted as:但是，该列表的格式为：

[' http://examplewebsite.org/page1 \n', ' http://examplewebsite.org/page2 \n', ' http://examplewebsite.org/page3 \n'] [' http://examplewebsite.org/page1 \n', ' http://examplewebsite.org/page2 \n', ' Z80791B3AE7002CB88C246876D9FAAwebsite.org/example.org/example.org/example .

I think that \n is the problem.我认为\n是问题所在。 I tried running the links all together and that didn't work.我尝试将所有链接一起运行，但没有成功。 Also for readability, I'd much prefer to have each link on a separate line.同样为了可读性，我更喜欢将每个链接放在单独的行上。 Any suggestions for how to address this would be deeply appreciated.任何有关如何解决此问题的建议将不胜感激。 Thanks.谢谢。

Answer 1

In order to get the list just as they are in the file, this line为了像在文件中一样获取列表，这一行

urls = list(f)

should look like this应该是这样的

urls = f.readlines()

It will return an array of every line in the txt file without any "\n"它将返回 txt 文件中每一行的数组，不带任何“\n”

Answer 2

Removing "\n" with the use of.rstrip() solved the issue.使用 of.rstrip() 删除“\n”解决了这个问题。 The code below is working and properly writes a group of news items to a single text file.下面的代码正在运行，并将一组新闻项正确写入单个文本文件。

import requests
from bs4 import BeautifulSoup

f = open('article-list.txt', 'r')
urls_n = list(f)
urls = [url.rstrip("\n") for url in urls_n]

for url in urls:
    import requests
    from bs4 import BeautifulSoup
    r = requests.get(url)
    coverpage = r.content
    soup = BeautifulSoup(coverpage, 'html5lib')
    file = open("output.txt", "a")
    file.write("ITEM:")
    paragraphs = soup.find_all("p")[11:-10]
    for paragraph in paragraphs:
        file.write(paragraph.get_text())
        file.write("\n")
        file.write("\n")
        print(paragraph.get_text())
file.close()

Python 将列表从文件传递到 requests.get()

问题描述

2 个解决方案

解决方案1
0 2020-04-25 19:09:28

解决方案2
0 已采纳 2020-04-26 16:23:31

Python 将列表从文件传递到 requests.get()

问题描述

2 个解决方案

解决方案1 0 2020-04-25 19:09:28

解决方案2 0 已采纳 2020-04-26 16:23:31

解决方案1
0 2020-04-25 19:09:28

解决方案2
0 已采纳 2020-04-26 16:23:31