简体   繁体   English

在python中逐行从txt输入输入到变量

[英]feeding input to variable from txt line by line in python

I have a variable DOMAIN which takes url as input. 我有一个变量DOMAIN,它以url作为输入。 I want to feed it list of URLs one by one from txt file. 我想从txt文件中一一填入URL列表。

My txt file looks like this: 我的txt文件如下所示:

www.yahoo.com
www.google.com
www.bing.com 

I am doing this: 我正在这样做:

with open('list.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content] 
DOMAIN = content

But the variable DOMAIN takes all URLs once, not separately. 但是变量DOMAIN仅一次(而不是单独)获取所有URL。 It must process one URL as whole and second on another operation. 它必须处理一个URL整体,然后处理另一个操作。

On a side note, this DOMAIN variable is feed to scrapy for crawling. 顺便提一句,此DOMAIN变量已供scrapy进行爬网。 part of codebase: 代码库的一部分:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
with open('list.txt') as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content] 
DOMAIN = content
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

errors: 错误:

scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://['www.google.com', 'www.yahoo.com', 'www.bing.com']>
executing as scrapy runspider spider.py

full working script for single url--- 单个网址的完整工作脚本-

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'google.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

Ok, so you are assigning the list of domains you are creating to DOMAIN. 好的,因此您正在将要创建的域列表分配给DOMAIN。

content = DOMAIN

You then need to concatenate 'http://' to each of these: 然后,您需要将“ http://”连接到以下每个地址:

with open('list.txt') as f:
    content = f.readlines()
content = [x.strip() for x in content]
domain_list = content
web = 'http://'
start_url = [web + s for s in domain_list]

Then you have a list of all your URL's, which you could use to connect. 然后,您将获得所有URL的列表,可用于连接。 I'm not sure what you are doing after this, but I think it should involve iterating over the list of start urls? 我不确定在此之后您正在做什么,但是我认为它应该涉及遍历起始URL列表?

for url in start_url:
    scrapy.Request(url)  

Hope this helps, 希望这可以帮助,

With these lines: 这些行:

DOMAIN = content
URL = 'http://%s' % DOMAIN

You can made DOMAIN point to the list you have just created from your file, and then concatenated http:// and a string representation of the list, so you get this: 您可以使DOMAIN指向刚从文件中创建的列表,然后将http://和该列表的字符串表示形式连接起来,这样您就可以得到:

`http://['www.google.com','www.yahoo.com', 'www.bing.com']'

Hence your error. 因此,您的错误。 You need a to concatenate the 'http://' to each entry of the list - you can simply do it while you read the file by iterating over the file directly in a list comprehension rather than using readlines() : 您需要将'http://'到列表的每个条目-您可以在读取文件时简单地做到这一点,方法是直接在列表理解中遍历文件,而不是使用readlines()

with open('list.txt','r') as f:
    url_list = ['http://'+line.strip() for line in f]

Which will yield a list you can then iterate over with scrapy: 这将产生一个列表,然后您可以使用scrapy进行迭代:

['http://www.google.com','http://www.yahoo.com', 'http://www.bing.com']

Note that reading the while file in at once can be considered inefficient if it's a really big file. 请注意,如果它是一个很大的文件,则一次读入while文件可能被认为效率低下。 In that case you could remove the need to read the whole file into a list, and just the requests as you process the file line-by-line: 在这种情况下,您可以不需要将整个文件读入列表,而只需要在逐行处理文件时将请求读入:

with open('list.txt','r') as f:
    for line in f:
        url = 'http://'+line
        request = scrapy.http.Request(url)
        # Do something with request here

Also, note that you should not use all UPPERCASE for variable names, these are generally used only for constants. 另外,请注意,不要将所有大写字母用作变量名,它们通常仅用于常量。 Have a look at PEP8 - The Python Style Guidelines for more guidance on naming conventions. 参阅PEP8-Python样式指南,以获取有关命名约定的更多指南。 Of course these are guidelines, not rules, but it will make is easier for others to follow your code later on if you follow them. 当然,这些只是准则,而不是规则,但是如果您遵循这些准则,则以后其他人可以更轻松地遵循您的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM