[英]feeding input to variable from txt line by line in python
I have a variable DOMAIN which takes url as input. 我有一个变量DOMAIN,它以url作为输入。 I want to feed it list of URLs one by one from txt file.
我想从txt文件中一一填入URL列表。
My txt file looks like this: 我的txt文件如下所示:
www.yahoo.com
www.google.com
www.bing.com
I am doing this: 我正在这样做:
with open('list.txt') as f:
content = f.readlines()
content = [x.strip() for x in content]
DOMAIN = content
But the variable DOMAIN takes all URLs once, not separately. 但是变量DOMAIN仅一次(而不是单独)获取所有URL。 It must process one URL as whole and second on another operation.
它必须处理一个URL整体,然后处理另一个操作。
On a side note, this DOMAIN variable is feed to scrapy for crawling. 顺便提一句,此DOMAIN变量已供scrapy进行爬网。 part of codebase:
代码库的一部分:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
with open('list.txt') as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
DOMAIN = content
URL = 'http://%s' % DOMAIN
class MySpider(BaseSpider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
errors: 错误:
scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://['www.google.com', 'www.yahoo.com', 'www.bing.com']>
executing as scrapy runspider spider.py
full working script for single url--- 单个网址的完整工作脚本-
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
DOMAIN = 'google.com'
URL = 'http://%s' % DOMAIN
class MySpider(BaseSpider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
print url
yield Request(url, callback=self.parse)
Ok, so you are assigning the list of domains you are creating to DOMAIN. 好的,因此您正在将要创建的域列表分配给DOMAIN。
content = DOMAIN
You then need to concatenate 'http://' to each of these: 然后,您需要将“ http://”连接到以下每个地址:
with open('list.txt') as f:
content = f.readlines()
content = [x.strip() for x in content]
domain_list = content
web = 'http://'
start_url = [web + s for s in domain_list]
Then you have a list of all your URL's, which you could use to connect. 然后,您将获得所有URL的列表,可用于连接。 I'm not sure what you are doing after this, but I think it should involve iterating over the list of start urls?
我不确定在此之后您正在做什么,但是我认为它应该涉及遍历起始URL列表?
for url in start_url:
scrapy.Request(url)
Hope this helps, 希望这可以帮助,
With these lines: 这些行:
DOMAIN = content
URL = 'http://%s' % DOMAIN
You can made DOMAIN
point to the list you have just created from your file, and then concatenated http://
and a string representation of the list, so you get this: 您可以使
DOMAIN
指向刚从文件中创建的列表,然后将http://
和该列表的字符串表示形式连接起来,这样您就可以得到:
`http://['www.google.com','www.yahoo.com', 'www.bing.com']'
Hence your error. 因此,您的错误。 You need a to concatenate the
'http://'
to each entry of the list - you can simply do it while you read the file by iterating over the file directly in a list comprehension rather than using readlines()
: 您需要将
'http://'
到列表的每个条目-您可以在读取文件时简单地做到这一点,方法是直接在列表理解中遍历文件,而不是使用readlines()
:
with open('list.txt','r') as f:
url_list = ['http://'+line.strip() for line in f]
Which will yield a list you can then iterate over with scrapy: 这将产生一个列表,然后您可以使用scrapy进行迭代:
['http://www.google.com','http://www.yahoo.com', 'http://www.bing.com']
Note that reading the while file in at once can be considered inefficient if it's a really big file. 请注意,如果它是一个很大的文件,则一次读入while文件可能被认为效率低下。 In that case you could remove the need to read the whole file into a list, and just the requests as you process the file line-by-line:
在这种情况下,您可以不需要将整个文件读入列表,而只需要在逐行处理文件时将请求读入:
with open('list.txt','r') as f:
for line in f:
url = 'http://'+line
request = scrapy.http.Request(url)
# Do something with request here
Also, note that you should not use all UPPERCASE for variable names, these are generally used only for constants. 另外,请注意,不要将所有大写字母用作变量名,它们通常仅用于常量。 Have a look at PEP8 - The Python Style Guidelines for more guidance on naming conventions.
请参阅PEP8-Python样式指南,以获取有关命名约定的更多指南。 Of course these are guidelines, not rules, but it will make is easier for others to follow your code later on if you follow them.
当然,这些只是准则,而不是规则,但是如果您遵循这些准则,则以后其他人可以更轻松地遵循您的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.