简体   繁体   English

哪个更快? 是Memcache还是文件查询? (使用maxmind geoip.dat文件)

[英]Which is quicker? Memcache or file query? (using maxmind geoip.dat file)

I'm using Python on Appengine and am looking up the geolocation of an IP address like this: 我在Appengine上使用Python,并且正在查找IP地址的地理位置,如下所示:

import pygeoip
gi = pygeoip.GeoIP('GeoIP.dat')
Location = gi.country_code_by_addr(self.request.remote_addr)

(pygeoip can be found here: http://code.google.com/p/pygeoip/ ) (可以在这里找到pygeoip: http : //code.google.com/p/pygeoip/

I want to geolocate each page of my app for a user so currently I lookup the IP address once then store it in memcache. 我想为用户定位应用程序的每个页面,因此当前我查找IP地址一次,然后将其存储在内存缓存中。

My question - which is quicker? 我的问题-哪个更快? Looking up the IP address each time from the .dat file or fetching it from memcache? 每次从.dat文件中查找IP地址,还是从memcache中获取IP地址? Are there any other pros/cons I need to be aware of? 我还有其他需要注意的利弊吗?

For general queries like this, is there a good guide to teach me how to optimise my code and run speed tests myself? 对于像这样的一般查询,是否有一个很好的指南可以教我如何优化代码并自己运行速度测试? I'm new to python and coding in general so apologies if this is a basic concept. 我是python和编码的新手,所以如果这是一个基本概念,我深表歉意。

Thanks! 谢谢!

Tom 汤姆

EDIT: Thanks for the responses, memcache seems to be the right answer. 编辑:感谢您的答复,内存缓存似乎是正确的答案。 I think that Nick and Lennart are suggesting that I add the whole gi variable to memcache. 我认为Nick和Lennart建议我将整个gi变量添加到内存缓存中。 I think this is possible. 我认为这是可能的。 FYI - the whole GeoIP.dat file is just over 1MB so not that large. 仅供参考-整个GeoIP.dat文件刚刚超过1MB,所以没有那么大。

What takes time there is rather loading the database from the dat file. 花费时间的是从dat文件加载数据库。 Once you have that in memory, the lookup time is not significant. 一旦将其存储在内存中,查找时间就不再重要了。 So if you can keep the gi variable in memory that seems the best solution. 因此,如果可以将gi变量保留在内存中,那似乎是最好的解决方案。

If you can't you probably can't use memcached either. 如果不能,则可能也不能使用memcached。

If you need to do lookups across multiple processes (which you almost certainly do on AppEngine), and you are likely to encounter the same ip address lots of times in a short time span (which you probably are), then using memcache is probably a good idea for speed. 如果您需要跨多个进程进行查找(您几乎肯定在AppEngine上进行过查找),并且很可能在很短的时间内(可能是)多次遇到相同的ip地址,那么使用内存缓存可能是一种速度的好主意。

More details, since you said you were relatively new to coding: 更多细节,因为您说过您对编码还比较陌生:

As Lennart Regebro correctly says, the slow thing is reading the geoip file from disk and parsing it. 正如Lennart Regebro正确地说的那样,最慢的事情是从磁盘读取geoip文件并进行解析。 Individual queries will then be fast. 这样,单个查询将很快。 However, if any given process is only serving one request (which, from your perspective, on AppEngine, it is), then this price will get paid on each request. 但是,如果任何给定的流程仅服务于一个请求(从您的角度来看,在AppEngine上就是这样),则将为每个请求支付此价格。 Caching recently used lookups in memcache will let you share this information across processes...but only for recently encountered data points. 在内存缓存中缓存最近使用的查找将使您可以跨进程共享此信息……但仅针对最近遇到的数据点。 However, since any given ip is likely to show up in bursts (because it is one user interacting with your site), this is exactly what you want. 但是,由于任何给定的ip都有可能突然出现(因为它是一个用户与您的站点进行交互),因此这正是您想要的。

Other alternatives are to pre-load all the data points into memcache. 其他选择是将所有数据点预加载到内存缓存中。 You probably don't want to do this, since you have a limited amount of memory available, and you won't end up using most of it. 您可能不想这样做,因为可用的内存量有限,并且最终不会使用其中的大部分内存。 (Also, memcache will throw parts of it away if you hit your memory limit, which means you'd need to write backup code to read from the geoip database live anyway.) In general, doing lazy caching -- look up a value the slow way when you first need it and then keep it around for re-use -- is a very effective mechanism. (此外,如果达到内存限制,则memcache会将其部分内容丢弃,这意味着无论如何,您都需要编写备份代码才能从geoip数据库中实时读取内容。)通常,进行延迟缓存-查找一个值当您第一次需要它,然后将其保留以供重复使用时,这种慢速方式是一种非常有效的机制。 Memcache is specifically geared for this, since it throws away data that hasn't been used recently when it encounters memory pressure. Memcache专门用于此目的,因为当遇到内存压力时,它会丢弃最近未使用的数据。

Another alternative in general (although not in AppEngine) is to run a separate process that handles just location queries, and having all your front-end processes talk to it (eg via thrift). 通常,另一种替代方法(尽管不是AppEngine中的方法)是运行一个单独的进程,该进程仅处理位置查询,并让您的所有前端进程与其进行对话(例如,通过节俭)。 Then you could use the suggestion of just loading up the geoip database in that process and querying it live for each request. 然后,您可以使用在该过程中仅加载geoip数据库并针对每个请求实时查询它的建议。

Hope that helps some. 希望有所帮助。

For individual IP addresses that you already have gotten out of the database, I would put them in memcache for sure. 对于您已经从数据库中删除的单个IP地址,我可以肯定地将它们放入内存缓存中。 I am assuming the database file is relatively large, and you don't want to load that from memcache every time you need to look up one address. 我假设数据库文件相对较大,并且您不想每次需要查找一个地址时从内存缓存中加载该文件。

One tool I know people use to help track speed of API calls is AppStats . 我知道人们用来帮助​​跟踪API调用速度的一种工具是AppStats It can help you see how long various calls to the APIs are taking. 它可以帮助您查看对API的各种调用所花费的时间。

Since you are new to programming in general, I will mention that appstats is a very App Engine specific tool. 由于您是一般编程人员,因此我将提到appstats是App Engine的一种非常特定的工具。 If you were just writing a basic python application that was going to run on your own computer, you could do timing of things by simply subtracting two timestamps: 如果您只是编写一个将在自己的计算机上运行的基本python应用程序,则可以通过简单地减去两个时间戳来进行计时:

import time
t1 = time.time()
#do whatever it is you want to time here.
t2 = time.time()
elapsed_time = t2-t1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM