[英]What's the fastest way to expand url in python
I have a checkin list which contains about 600000 checkins, and there is a url in each checkin, I need to expand them back to original long ones. 我有一个签入列表,其中包含约600000个签入,并且每个签入中都有一个url,我需要将它们扩展回原始的长签。 I do so by 我这样做
now = time.time()
files_without_url = 0
for i, checkin in enumerate(NYC_checkins):
try:
foursquare_url = urllib2.urlopen(re.search("(?P<url>https?://[^\s]+)", checkin[5]).group("url")).url
except:
files_without_url += 1
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But this takes too long time: from 0 to 1000 checkins, it takes 3241 seconds ! 但是,这花费的时间太长:从0到1000签入,需要3241秒 ! Is this normal? 这正常吗? What's the most efficient way to expand url by Python? 用Python扩展url的最有效方法是什么?
MODIFIED: Some Urls are from Bitly while some others are not, and I am not sure where they come from. 修改:有些Urls来自Bitly,而另一些则不是,我不确定它们来自何处。 In this case, I wanna simply use urllib2
module. 在这种情况下,我只想使用urllib2
模块。
for your information, here is an example of checkin[5]
: 供您参考,这是checkin[5]
的示例:
I'm at The Diner (2453 18th Street NW, Columbia Rd., Washington) w/ 4 others. http...... (this is the short url)
I thought I would expand on my comment regarding the use of multiprocessing
to speed up this task. 我以为我会在有关使用multiprocessing
来加快此任务的速度方面发表更多评论。
Let's start with a simple function that will take a url and resolve it as far as possible (following redirects until it gets a 200
response code): 让我们从一个简单的函数开始,该函数将获取一个URL并尽可能地解析它(跟随重定向,直到获得200
响应代码):
import requests
def resolve_url(url):
try:
r = requests.get(url)
except requests.exceptions.RequestException:
return (url, None)
if r.status_code != 200:
longurl = None
else:
longurl = r.url
return (url, longurl)
This will either return a (shorturl, longurl)
tuple, or it will return (shorturl, None)
in the event of a failure. 这将返回一个(shorturl, longurl)
元组,或者在失败的情况下返回(shorturl, None)
。
Now, we create a pool of workers: 现在,我们创建一个工人池:
import multiprocessing
pool = multiprocessing.Pool(10)
And then ask our pool to resolve a list of urls: 然后要求我们的池解析URL列表:
resolved_urls = []
for shorturl, longurl in pool.map(resolve_url, urls):
resolved_urls.append((shorturl, longurl))
Using the above code... 使用上面的代码...
This is hopefully enough to get you started. 希望这足以使您入门。
(NB: you could write a similar solution using the threading
module rather than multiprocessing
. I usually just grab for multiprocessing
first, but in this case either would work, and threading might even be slightly more efficient.) (注意:您可以使用threading
模块而不是multiprocessing
来编写类似的解决方案。通常,我通常只抢先进行multiprocessing
,但是在这种情况下,两者都可以工作,并且线程的使用效率甚至更高。)
Thread are most appropriate in case of network I/O. 对于网络I / O,线程是最合适的。 But you could try the following first. 但是您可以先尝试以下方法。
pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
missing_urls = 0
bad_urls = 0
def check(checkin):
match = pat.search(checkin[5])
if not match:
global missing_urls
missing_urls += 1
else:
url = match.group("url")
try:
urllib2.urlopen(url) # don't lookup .url if you don't need it later
except URLError: # or just Exception
global bad_urls
bad_urls += 1
for i, checkin in enumerate(NYC_checkins):
check(checkin)
print(bad_urls, missing_urls)
If you get no improvement, now that we have a nice check
function, create a threadpool and feed it. 如果您没有任何改善,现在我们有了一个不错的check
功能,请创建一个线程池并进行处理。 Speedup is guaranteed. 加速得到保证。 Using processes for network I/O is pointless 使用过程进行网络I / O毫无意义
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.