简体   繁体   English

如何使用 Python 优化大型数据集的 API 调用?

[英]How to optimize API calls for a large dataset using Python?

Objective : Send a list of addresses to an API and extract certain information(eg: a flag which indicates if an address is in a flood zone or not).目标:向 API 发送地址列表并提取某些信息(例如:指示地址是否位于洪水区的标志)。

Solution : Working Python script for small data.解决方案:用于小数据的工作 Python 脚本。

Problem : I want to optimize my current solution for large input.问题:我想针对大输入优化当前的解决方案。 How to improve the performance of the API calls.如何提高 API 调用的性能。 If I have 100,000 addresses will my current solution fail?如果我有 100,000 个地址,我当前的解决方案会失败吗? Will this slow down the HTTP calls?这会减慢 HTTP 调用的速度吗? Will I get a request TIME out?我会收到 TIME 请求吗? Does the API resist the number of API calls being made? API 是否会阻止 API 调用的数量?

  • Input: a list of addresses输入:地址列表

Sample input样本输入

777 Brockton Avenue, Abington MA 2351 777 布罗克顿大道,阿宾顿 MA 2351

30 Memorial Drive, Avon MA 2322 30 Memorial Drive, Avon MA 2322

My current solution works well for a small dataset.我当前的解决方案适用于小型数据集。

# Creating a function to get lat & long of the existing adress and then detecting the zone in fema
def zonedetect(addrs):
    global geolocate
    geocode_result = geocode(address=addrs, as_featureset=True)
    latitude = geocode_result.features[0].geometry.x
    longitude = geocode_result.features[0].geometry.y
    url = "https://hazards.fema.gov/gis/nfhl/rest/services/public/NFHL/MapServer/28/query?where=1%3D1&text=&objectIds=&time=&geometry="+str(latitude)+"%2C"+str(longitude)+"&geometryType=esriGeometryPoint&inSR=4326&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&queryByDistance=&returnExtentsOnly=false&datumTransformation=&parameterValues=&rangeValues=&f=json"
    response = req.get(url)
    parsed_data = json.loads(response.text)
    formatted_data = json_normalize(parsed_data["features"])
    formatted_data["Address_1"] = addrs

    #Exception handling
    if response.status_code == 200:
        geolocate = geolocate.append(formatted_data, ignore_index = True)
    else: 
        print("Request to {} failed".format(postcode))

# Reading every adress from existing dataframe
for i in range(len(df.index)):
    zonedetect(df["Address"][i])

Instead of using the for loop above is there an alternative.除了使用上面的 for 循环之外,还有一种替代方法。 Can I process this logic in a batch?我可以批量处理这个逻辑吗?

Sending 100,000 requests to the hazards.fema.gov server will definitely cause some slow downs on their server but it will mostly impact your script as you will need to wait for every single HTTP request to be queued and responded to which could take an extremely long time to process.hazards.fema.gov服务器发送 100,000 个请求肯定会导致其服务器速度变慢,但这主要会影响您的脚本,因为您需要等待每个 HTTP 请求排队并响应,这可能需要很长时间处理的时间。

What would be better is to send one REST query for everything you will need and then handle the logic afterwards.更好的是为您需要的所有内容发送一个 REST 查询,然后处理逻辑。 Looking at the REST API, you can find that the geometry URL parameter is able to accept a geometryMultiPoint from the docs .查看 REST API,您会发现geometry URL 参数能够接受来自 docsgeometryMultiPoint Here is an example of a multiPoint:下面是一个多点的例子:

{
  "points" : [[-97.06138,32.837],[-97.06133,32.836],[-97.06124,32.834],[-97.06127,32.832]],
  "spatialReference" : {"wkid" : 4326}
}

So what you can do is make an object to store all the points you want to query:所以你可以做的是创建一个对象来存储你想要查询的所有点:

multipoint = { points: [], spatialReference: { wkid: 4326}

And when you loop, append the lat/long point to the multipoint list:当您循环时,将经纬度点附加到多点列表中:

for i in range(len(df.index)):
    address = df["Address"][i]
    geocode_result = geocode(address=addrs, as_featureset=True)
    latitude = geocode_result.features[0].geometry.x
    longitude = geocode_result.features[0].geometry.y
    multiPoint.points.append([latitude, longitude])

Then you can set the multipoint as the geometry in your query which results in just one API request instead of one for each point.然后,您可以将多点设置为查询中的geometry ,这样只会产生一个 API 请求,而不是每个点一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM