简体   繁体   English

Python 将列表从文件传递到 requests.get()

[英]Python passing a list from a file to requests.get()

I'm trying to scrape a corpus of news article for analysis.我正在尝试抓取新闻文章的语料库进行分析。 I have a text file with a list of URL's and I'm trying to pass these to requests so that the page can be scraped with BeautifulSoup.我有一个带有 URL 列表的文本文件,我正在尝试将这些传递给请求,以便可以使用 BeautifulSoup 抓取该页面。 I can pull the urls from the text file.我可以从文本文件中提取网址。 However, I'm not properly passing that outputto requests.get().但是,我没有正确地将输出传递给 requests.get()。 When I give requests.get() an explicit url, the script works fine.当我给 requests.get() 一个明确的 url 时,脚本工作正常。 How do I properly pass to requests.get() a list of links from a text file?如何正确地将文本文件中的链接列表传递给 requests.get()? Here is what I have working.这是我的工作。

import requests
from bs4 import BeautifulSoup
r = requests.get("https://examplewebsite.org/page1")
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
file = open("output.txt", "w")
file.write("ITEM:")
paragraphs = soup.find_all("p")[11:-10]
for paragraph in paragraphs:
    file.write(paragraph.get_text())
    file.write("\n")
    file.write("\n")
file.close()

However, when I try to read from a text file that is a list of links, there seems to be a problem with how I'm passing the links to requests.get().但是,当我尝试从作为链接列表的文本文件中读取时,我将链接传递给 requests.get() 的方式似乎存在问题。 With one URL per line, the text file list of links looks like每行一个 URL,链接的文本文件列表看起来像

https://examplewebsite.org/page1 https://examplewebsite.org/page1
https://examplewebsite.org/page2 https://examplewebsite.org/page2
https://examplewebsite.org/page3 https://examplewebsite.org/page3
https://examplewebsite.org/page4 https://examplewebsite.org/page4

Here is how I'm trying to work through the list of links.这是我试图通过链接列表工作的方式。

f = open('article-list.txt', 'r')
urls = list(f)
for url in urls:
    import requests
    from bs4 import BeautifulSoup
    r = requests.get(url)
    coverpage = r.content
    soup = BeautifulSoup(coverpage, 'html5lib')
    file = open("output.txt", "w")
    file.write("ITEM:")
    paragraphs = soup.find_all("p")[11:-10]
    for paragraph in paragraphs:
        file.write(paragraph.get_text())
        file.write("\n")
        file.write("\n")
        print(paragraph.get_text())
file.close()

What I get is an error saying我得到的是一个错误说

AttributeError: 'NoneType' object has no attribute 'get_text'

This suggests to me I'm not properly passing the request.这表明我没有正确传递请求。 If I simply swap an explicitly defined url like " https://somewebsite.org/page1 " then the script works and writes paragraphs to the file.如果我只是简单地交换一个明确定义的 url ,例如“ https://somewebsite.org/page1 ”,那么脚本可以工作并将段落写入文件。 Yet when I put a print(urls) statement at the top and give requests.get() an explicit link so it does not break, I get a list of urls.然而,当我在顶部放置一个print(urls)语句并给 requests.get() 一个显式链接以使其不会中断时,我得到了一个 url 列表。 However, that list is formatted as:但是,该列表的格式为:

[' http://examplewebsite.org/page1 \n', ' http://examplewebsite.org/page2 \n', ' http://examplewebsite.org/page3 \n'] [' http://examplewebsite.org/page1 \n', ' http://examplewebsite.org/page2 \n', ' Z80791B3AE7002CB88C246876D9FAAwebsite.org/example.org/example.org/example .

I think that \n is the problem.我认为\n是问题所在。 I tried running the links all together and that didn't work.我尝试将所有链接一起运行,但没有成功。 Also for readability, I'd much prefer to have each link on a separate line.同样为了可读性,我更喜欢将每个链接放在单独的行上。 Any suggestions for how to address this would be deeply appreciated.任何有关如何解决此问题的建议将不胜感激。 Thanks.谢谢。

In order to get the list just as they are in the file, this line为了像在文件中一样获取列表,这一行

urls = list(f)

should look like this应该是这样的

urls = f.readlines()

It will return an array of every line in the txt file without any "\n"它将返回 txt 文件中每一行的数组,不带任何“\n”

Removing "\n" with the use of.rstrip() solved the issue.使用 of.rstrip() 删除“\n”解决了这个问题。 The code below is working and properly writes a group of news items to a single text file.下面的代码正在运行,并将一组新闻项正确写入单个文本文件。

import requests
from bs4 import BeautifulSoup

f = open('article-list.txt', 'r')
urls_n = list(f)
urls = [url.rstrip("\n") for url in urls_n]

for url in urls:
    import requests
    from bs4 import BeautifulSoup
    r = requests.get(url)
    coverpage = r.content
    soup = BeautifulSoup(coverpage, 'html5lib')
    file = open("output.txt", "a")
    file.write("ITEM:")
    paragraphs = soup.find_all("p")[11:-10]
    for paragraph in paragraphs:
        file.write(paragraph.get_text())
        file.write("\n")
        file.write("\n")
        print(paragraph.get_text())
file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM