[英]Get status code of url efficiently in python, alternative for for-loop
I want to check a list of urls (in a column of a dataframe df) for their status codes (404, 403 and 200 seem to be the interesting ones). 我想查看一个网址列表(在数据框df的列中)以获取其状态代码(404,403和200似乎是有趣的)。 I defined a function which does the job. 我定义了一个完成这项工作的功能。 However, it uses a for-loop which is inefficient (I have a long list of urls!). 但是,它使用效率低下的for循环(我有一长串网址!)。
Does anyone have a hint on how to do it more efficiently? 有没有人提示如何更有效地做到这一点? Optimally the returned status code would also be displayed in a new column of the dataframe, eg df['status_code_url']. 最佳地,返回的状态代码也将显示在数据帧的新列中,例如df ['status_code_url']。
def url_access(df, column):
e_404 =0
e_403 =0
e_200 =0
for i in range(0, len(df)):
if requests.head(df[column][i]).status_code == 404:
e_404= e_404+1
elif requests.head(df[column][i]).status_code == 403:
e_403 = e_403 +1
elif requests.head(df[column][i]).status_code == 200:
e_200 = e_200 +1
else:
print(requests.head(df[column][i]).status_code)
return ("Statistics about " + column , '{:.1%}'.format(e_404/len(df))
+ " of links to intagram post return 404", '{:.1%}'.format(e_403/len(df))
+ " of links to intagram post return 403", '{:.1%}'.format(e_200/len(df))
+ " of links to intagram post return 200")
Thank you a lot! 非常感谢!
Use Pandas
, apply
and groupby
- 使用Pandas
, apply
和groupby
-
def url_access(x):
return requests.head(x).status_code
df['Status'] = df['url'].apply(url_access)
dfcount = df.groupby('Status')['url'].count().reset_index()
Basically, your task seems to be: 基本上,你的任务似乎是:
For first step you use: 对于第一步,您使用:
def get_code(url):
return requests.head(url).status_code
For second step you apply this fucntion to dataframe column, see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html 第二步,将此功能应用于dataframe列,请参阅https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
resp_df = df[column].apply(get_code, axis=1)
For third step you can use opertions over column to calculate percentages: 对于第三步,您可以使用列上的opertions来计算百分比:
resp_df[resp_df == 404].sum() / len (resp_df)
(note code not run) (注意代码没有运行)
pandas.DataFrame.apply
(or rather, the normal requests
library) will only be able to make one request at a time. pandas.DataFrame.apply
(或者更确切地说,正常的requests
库)一次只能发出一个请求。 To do multiple requests in parallel, you can use requests_futures
(install it with pip install requests-futures
): 要并行执行多个请求,您可以使用requests_futures
(使用pip install requests-futures
安装它):
import pandas as pd
from requests_futures.sessions import FuturesSession
def get_request(url):
session = FuturesSession()
return session.head(url)
def get_status_code(r):
return r.result().status_code
if __name__ == "__main__":
urls = ['http://python-requests.org',
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com']
df = pd.DataFrame({"url": urls})
df["status_code"] = df["url"].apply(get_request).apply(get_status_code)
Afterwards you can use for example groupby
, as suggested by @Aritesh in their answer : 之后你可以使用例如groupby
,正如@Aritesh在他们的回答中所建议的那样 :
stats = df.groupby('status_code')['url'].count().reset_index()
print(stats)
# status_code url
0 200 1
1 301 3
With this you probably also want to add some protection against connection errors and a timeout: 有了这个,您可能还想添加一些防止连接错误和超时的保护:
import numpy as np
import requests
def get_request(url):
session = FuturesSession()
return session.head(url, timeout=1)
def get_status_code(r):
try:
return r.result().status_code
except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
return 408 # Request Timeout
ips = np.random.randint(0, 256, (1000, 4))
df = pd.DataFrame({"url": ["http://" + ".".join(map(str, ip)) for ip in ips]})
df["status_code"] = df["url"].apply(get_request).apply(get_status_code)
df.groupby('status_code')['url'].count().reset_index()
# status_code url
# 0 200 3
# 1 302 2
# 2 400 2
# 3 401 1
# 4 403 1
# 5 404 1
# 6 408 990
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.