在python中逐行从txt输入输入到变量

Question

I have a variable DOMAIN which takes url as input. 我有一个变量DOMAIN，它以url作为输入。 I want to feed it list of URLs one by one from txt file. 我想从txt文件中一一填入URL列表。

My txt file looks like this: 我的txt文件如下所示：

www.yahoo.com
www.google.com
www.bing.com

I am doing this: 我正在这样做：

with open('list.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content] 
DOMAIN = content

But the variable DOMAIN takes all URLs once, not separately. 但是变量DOMAIN仅一次（而不是单独）获取所有URL。 It must process one URL as whole and second on another operation. 它必须处理一个URL整体，然后处理另一个操作。

On a side note, this DOMAIN variable is feed to scrapy for crawling. 顺便提一句，此DOMAIN变量已供scrapy进行爬网。 part of codebase: 代码库的一部分：

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
with open('list.txt') as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content] 
DOMAIN = content
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

errors: 错误：

scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://['www.google.com', 'www.yahoo.com', 'www.bing.com']>
executing as scrapy runspider spider.py

full working script for single url--- 单个网址的完整工作脚本-

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'google.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

Answer 1

Ok, so you are assigning the list of domains you are creating to DOMAIN. 好的，因此您正在将要创建的域列表分配给DOMAIN。

content = DOMAIN

You then need to concatenate 'http://' to each of these: 然后，您需要将“ http：//”连接到以下每个地址：

with open('list.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content]
domain_list = content
web = 'http://'
start_url = [web + s for s in domain_list]

Then you have a list of all your URL's, which you could use to connect. 然后，您将获得所有URL的列表，可用于连接。 I'm not sure what you are doing after this, but I think it should involve iterating over the list of start urls? 我不确定在此之后您正在做什么，但是我认为它应该涉及遍历起始URL列表？

for url in start_url:
    scrapy.Request(url)

Hope this helps, 希望这可以帮助，

Answer 2

With these lines: 这些行：

DOMAIN = content
URL = 'http://%s' % DOMAIN

You can made DOMAIN point to the list you have just created from your file, and then concatenated http:// and a string representation of the list, so you get this: 您可以使DOMAIN指向刚从文件中创建的列表，然后将http://和该列表的字符串表示形式连接起来，这样您就可以得到：

`http://['www.google.com','www.yahoo.com', 'www.bing.com']'

Hence your error. 因此，您的错误。 You need a to concatenate the 'http://' to each entry of the list - you can simply do it while you read the file by iterating over the file directly in a list comprehension rather than using readlines() : 您需要将'http://'到列表的每个条目-您可以在读取文件时简单地做到这一点，方法是直接在列表理解中遍历文件，而不是使用readlines() ：

with open('list.txt','r') as f:
    url_list = ['http://'+line.strip() for line in f]

Which will yield a list you can then iterate over with scrapy: 这将产生一个列表，然后您可以使用scrapy进行迭代：

['http://www.google.com','http://www.yahoo.com', 'http://www.bing.com']

Note that reading the while file in at once can be considered inefficient if it's a really big file. 请注意，如果它是一个很大的文件，则一次读入while文件可能被认为效率低下。 In that case you could remove the need to read the whole file into a list, and just the requests as you process the file line-by-line: 在这种情况下，您可以不需要将整个文件读入列表，而只需要在逐行处理文件时将请求读入：

with open('list.txt','r') as f:
    for line in f:
        url = 'http://'+line
        request = scrapy.http.Request(url)
        # Do something with request here

Also, note that you should not use all UPPERCASE for variable names, these are generally used only for constants. 另外，请注意，不要将所有大写字母用作变量名，它们通常仅用于常量。 Have a look at PEP8 - The Python Style Guidelines for more guidance on naming conventions. 请参阅PEP8-Python样式指南，以获取有关命名约定的更多指南。 Of course these are guidelines, not rules, but it will make is easier for others to follow your code later on if you follow them. 当然，这些只是准则，而不是规则，但是如果您遵循这些准则，则以后其他人可以更轻松地遵循您的代码。

在python中逐行从txt输入输入到变量

问题描述

2 个解决方案

解决方案1
0 2017-02-03 08:07:58

解决方案2
0 2017-02-03 08:34:38

在python中逐行从txt输入输入到变量

问题描述

2 个解决方案

解决方案1 0 2017-02-03 08:07:58

解决方案2 0 2017-02-03 08:34:38

解决方案1
0 2017-02-03 08:07:58

解决方案2
0 2017-02-03 08:34:38