简体   繁体   English

如何提高脚本的性能?

[英]How can I improve the performance of my script?

I have a "seed" GeoDataFrame (GDF)(RED) which contains a 0.5 arc minutes global grid ((180*2)*(360*2) = 259200).我有一个“种子”GeoDataFrame (GDF)(RED),它包含一个 0.5 弧分的全局网格 ((180*2)*(360*2) = 259200)。 Each cell contains an absolute population estimate.每个单元格都包含一个绝对人口估计值。 In addition, I have a "leech" GDF (GREEN) with roughly 8250 adjoining non-regular shapes of various sizes (watersheds).此外,我有一个“水蛭”GDF(绿色),其中包含大约 8250 个相邻的各种大小(分水岭)的非常规形状。

I wrote a script to allocate the population estimates to the geometries in the leech GDF based on the overlapping area between grid cells (seed GDF) and the geometries in the leech GDF.我编写了一个脚本,根据网格单元(种子 GDF)和水蛭 GDF 中的几何形状之间的重叠区域,将种群估计值分配给水蛭 GDF 中的几何形状。 The script works perfectly fine for my sample data (see below).该脚本非常适合我的示例数据(见下文)。 However, once I run it on my actual data, it is very slow.然而,一旦我在我的实际数据上运行它,它就非常慢。 I ran it overnight and the next morning only 27% of the calculations had been performed.我运行了一夜,第二天早上只执行了 27% 的计算。 I will have to run this script many times and waiting for two days each time, is simply not an option.我将不得不多次运行此脚本并且每次都等待两天,这根本不是一种选择。

After doing a bit of literature research, I already replaced (?) for loops with for index i in df.iterrows() (or is this the same as "conventional" python for loops) but it didn't bring about the performance imporvement I had hoped for.在做了一些文献研究之后,我已经for index i in df.iterrows()中用for index i in df.iterrows()替换了 (?) for 循环(或者这与“常规”python for 循环相同)但它并没有带来性能提升我曾希望。

Any suggestion son how I can speed up my code?任何建议儿子如何加快我的代码? In twelve hours, my script only processed only ~30000 rows out of ~200000.在 12 小时内,我的脚本仅处理了约 200000 行中的约 30000 行。

My expected output is the column leech_df['leeched_values'] .我的预期输出是列leech_df['leeched_values']

在此处输入图片说明

import geopandas as gpd
import time
from datetime import datetime
from shapely.geometry import Polygon

# =============================================================================
# Geometries for testing
# =============================================================================

polys1 = gpd.GeoSeries([Polygon([(0.00,0.00), (0.00,0.25), (0.25,0.25), (0.25,0.00)]),
                        Polygon([(0.00,0.25), (0.00,0.50), (0.25,0.50), (0.25,0.25)]),
                        Polygon([(0.00,0.50), (0.00,0.75), (0.25,0.75), (0.25,0.50)]),
                        Polygon([(0.25,0.00), (0.25,0.25), (0.50,0.25), (0.50,0.00)]),
                        Polygon([(0.25,0.25), (0.25,0.50), (0.50,0.50), (0.50,0.25)]),
                        Polygon([(0.25,0.50), (0.25,0.75), (0.50,0.75), (0.50,0.50)]),
                        Polygon([(0.50,0.00), (0.50,0.25), (0.75,0.25), (0.75,0.00)]),
                        Polygon([(0.50,0.25), (0.50,0.50), (0.75,0.50), (0.75,0.25)]),
                        Polygon([(0.50,0.50), (0.50,0.75), (0.75,0.75), (0.75,0.50)]),
                        ])

polys2 = gpd.GeoSeries([Polygon([(0.125,0.125), (0.125,0.375), (0.375,0.375), (0.375,0.125)]),
                        Polygon([(0.050,0.550), (0.050,0.700), (0.200,0.700), (0.200,0.550)]),
                        Polygon([(0.25,0.625), (0.25,0.375), (0.750,0.375), (0.750,0.625)]),
                        ])

seed_df = gpd.GeoDataFrame({'geometry': polys1, 'seed_value':[10,10,10,10,10,10,10,10,10]})
leech_df = gpd.GeoDataFrame({'geometry': polys2})

del polys1, polys2

seed_value = 'seed_value'

# =============================================================================
# Prepare DataFrames
# =============================================================================

start = time.time()

print('\n\nPrepare DataFrames ... ')

# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
seed_df = seed_df.reset_index()
seed_df = seed_df.rename(columns={'index': 'seed_index'})

# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
leech_df = leech_df.reset_index()
leech_df = leech_df.rename(columns={'index': 'leech_index'})

end = time.time()
print(end - start)

# Add the area to the geometries

start = time.time()

print('Calculating the area of the leech DF geometries ...')
leech_df['leech_area'] = leech_df['geometry'].area

print('Calculating the area of the seed DF geometries ...')
seed_df['seed_area'] = seed_df['geometry'].area

leech_df['leeched_value'] = 0.0

end = time.time()
print(end - start)

# =============================================================================
# Merge seed and leech data and count overlaps
# =============================================================================

start = time.time()

print('Merging DataFrames ... ')

merged_df = gpd.overlay(leech_df, seed_df, how='union')
# Drop NaNs
merged_df = merged_df.dropna(axis='rows')

# =============================================================================
# Allocate seed values
# =============================================================================

# Count with how many leech geometries each seed geometry overlaps

print('Count overlaps ... ')

overlaps = merged_df['seed_index'].value_counts()

neglected_values = list()
one_overlaps_values = list()
more_overlaps_values = list()

no_overlaps = list()
one_overlaps = list()
more_overlaps = list()

end = time.time()
print(end - start)

start = time.time()

print('Allocate seed values ... ')

i = 0

for index, row in seed_df.iterrows(): # For each row in seed_df MAX 70123

    if row['seed_index'] % 1000 == 0:
        print(str(row['seed_index'])+'k  at  '+str(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))

    # If seed geometry does not have an overlap
    # Get the value with the seed_index == 0
    # Otherwise return 0
    # So whenever, the seedindex does not turn up in overlaps return zero

    if overlaps.get(row['seed_index'],0) == 0: # If seed geometry does not have an overlap
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': No overlap')
        no_overlaps.append(1) # Count the values which have not been considered
        neglected_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
    elif overlaps.get(row['seed_index'],0) == 1:
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': One overlap')
        one_overlaps.append(1)
        # What is for row the leech index (with which leech geometry does an overlap exist)?
        temp_int = int(merged_df[merged_df['seed_index'] == row['seed_index']]['leech_index'])
        # For this leech index replace leeched_value with leeched_value + leeched_value
        seed_value_amount = int(seed_df[seed_df['seed_index'] == row['seed_index']][seed_value])
        leech_df.loc[temp_int,'leeched_value'] = leech_df.loc[temp_int,'leeched_value'] + seed_value_amount
        one_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
    elif overlaps.get(row['seed_index'],0) > 1:
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': More than one overlap')
        more_overlaps.append(1)
        # Create a DF which contains the overlaps of the seed geometries with geoemtries of the leech df
        temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]
        # Calculate the overlaying area
        temp_df['overlay_area'] = temp_df['geometry'].area
        # And comapre this to the total overlaying area of this grid cell
        temp_df['rel_contribution'] = 0.0
        temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)
        # Calcualte the absolute distribution of the seed value to each leech row ('leech index')
        temp_df['abs_contribution'] = temp_df[seed_value]*temp_df['rel_contribution']
        # Reset index
        temp_df = temp_df.reset_index(drop=True)
        more_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
        #For each overlap between grid cell (seed) and leech geometry:
        for j in range(len(leech_df)):
            #print('==   LEECH ' + str(j) + '.' +str(i))
            # Create a DF which only contains the temp_df row for the specific leech geometry
            contribution = temp_df[temp_df['leech_index'] == j]
            # Save the contribution to the leech_df
            #leech_df['leeched_value'][j] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])
            leech_df.loc[j,'leeched_value'] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])

end = time.time()
print(end - start)
print('Done')

# =============================================================================
# Data check
# =============================================================================

print('>1 Overlaps: ' + str(sum(more_overlaps)) + ' (sum neglected values ' + str(sum(more_overlaps_values)) + ')' )
print('=1 Overlaps: ' + str(sum(one_overlaps)) + ' (sum neglected values ' + str(sum(one_overlaps_values)) + ')' )
print('No Overlaps: ' + str(sum(no_overlaps)) + ' (sum neglected values ' + str(sum(neglected_values)) + ')' )

print('Unconsidered: ' + str(sum(seed_df[seed_value]) - (sum(more_overlaps_values)+sum(one_overlaps_values)+sum(neglected_values))))


# =============================================================================
# Plot
# =============================================================================

# Plot
base = seed_df.plot(color='red', edgecolor='black');
leech_df.plot(ax=base, color='green', alpha=0.5);

EDIT: pip freeze > requirements_output.txt returns:编辑: pip freeze > requirements_output.txt返回:

affine==2.3.0
alabaster==0.7.12
appdirs==1.4.3
argh==0.26.2
asn1crypto==1.3.0
astroid==2.3.3
atomicwrites==1.3.0
attrs==19.3.0
autopep8==1.4.4
Babel==2.8.0
backcall==0.1.0
bcrypt==3.1.7
bleach==3.1.0
cachetools==4.0.0
Cartopy==0.17.0
certifi==2019.11.28
cffi==1.13.2
cftime==1.0.4.2
chardet==3.0.4
Click==7.0
click-plugins==1.1.1
cligj==0.5.0
cloudpickle==1.2.2
colorama==0.4.3
country-converter==0.6.6
cryptography==2.8
cycler==0.10.0
dask==2.10.0
datacube==1.7
decorator==4.4.1
defusedxml==0.6.0
Deprecated==1.2.7
descartes==1.1.0
diff-match-patch==20181111
docutils==0.16
entrypoints==0.3
Fiona==1.8.13
flake8==3.7.9
future==0.18.2
GDAL==3.0.2
geocube==0.0.10
geopandas==0.6.2
h5py==2.9.0
idna==2.8
imageio==2.6.1
imagesize==1.2.0
importlib-metadata==1.4.0
intervaltree==3.0.2
ipykernel==5.1.4
ipython==7.11.1
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.14.1
Jinja2==2.10.3
joblib==0.14.1
jsonschema==3.2.0
jupyter-client==5.3.4
jupyter-core==4.6.1
keyring==21.1.0
kiwisolver==1.1.0
lark-parser==0.8.1
lazy-object-proxy==1.4.3
lxml==4.4.2
mapclassify==2.2.0
MarkupSafe==1.1.1
matplotlib==3.1.1
mccabe==0.6.1
mistune==0.8.4
mkl-fft==1.0.15
mkl-random==1.1.0
mkl-service==2.3.0
more-itertools==8.0.2
munch==2.5.0
nbconvert==5.6.1
nbformat==5.0.4
netCDF4==1.5.3
notebook==6.0.0
numpy==1.16.4
numpydoc==0.9.2
olefile==0.46
OWSLib==0.19.1
packaging==20.1
pandas==0.25.0
pandocfilters==1.4.2
paramiko==2.6.0
parso==0.6.0
pathtools==0.1.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.0.0
pluggy==0.13.1
prometheus-client==0.7.1
prompt-toolkit==3.0.3
psutil==5.6.7
psycopg2==2.8.4
pycodestyle==2.5.0
pycparser==2.19
pydocstyle==4.0.1
pyepsg==0.4.0
pyflakes==2.1.1
Pygments==2.5.2
pykdtree==1.3.1
pylint==2.4.4
pymatreader==0.0.20
Pympler==0.7
pymrio==0.4.0
PyNaCl==1.3.0
pyOpenSSL==19.1.0
pyparsing==2.4.6
pyPEG2==2.15.2
pyproj==2.4.2.post1
PyQt5-sip==4.19.19
pyrsistent==0.15.7
pyshp==2.1.0
PySocks==1.7.1
python-dateutil==2.8.1
python-jsonrpc-server==0.3.4
python-language-server==0.31.7
pytz==2019.3
pywin32==227
pywin32-ctypes==0.2.0
pywinpty==0.5.7
PyYAML==5.2
pyzmq==18.1.0
QDarkStyle==2.8
QtAwesome==0.6.1
qtconsole==4.6.0
QtPy==1.9.0
rasterio==1.1.2
requests==2.22.0
rioxarray==0.0.19
rope==0.16.0
Rtree==0.9.3
scikit-learn==0.22.1
scipy==1.3.2
Send2Trash==1.5.0
Shapely==1.7.0
singledispatch==3.4.0.3
six==1.14.0
snowballstemmer==2.0.0
snuggs==1.4.7
sortedcontainers==2.1.0
Sphinx==2.3.1
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
spyder==4.0.0
spyder-kernels==1.8.1
spyder-notebook==0.1.4
SQLAlchemy==1.3.13
terminado==0.8.3
testpath==0.4.4
toolz==0.10.0
tornado==6.0.3
tqdm==4.43.0
traitlets==4.3.3
ujson==1.35
urllib3==1.25.8
watchdog==0.9.0
wcwidth==0.1.7
webencodings==0.5.1
win-inet-pton==1.1.0
wincertstore==0.2
wrapt==1.11.2
xarray==0.14.1
xlrd==1.2.0
xmltodict==0.12.0
yapf==0.28.0
zipp==0.6.0

Introduction介绍

It might be worthy to profile your code in details to get precise insights of what is your bottleneck.详细分析您的代码以准确了解您的瓶颈可能是值得的。

Bellow some advises to already improve your script performance:下面是一些已经提高脚本性能的建议:

Eg.例如。 this line:这一行:

temp_df['rel_contribution'] = 0.0
temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)

Probable bottleneck可能的瓶颈

  • Boolean indexing is a great capability, but having it nested in a loop is probably the biggest performance issue in your algorithm.布尔索引是一项很棒的功能,但将它嵌套在循环中可能是算法中最大的性能问题。 You bottleneck probably comes from there.你的瓶颈可能来自那里。

Eg.例如。 this line performs boolean indexing within a loop:此行在循环内执行布尔索引:

temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]
  • Make use of index as it generally will drastically lower the time complexity of algorithms (such as rtree as suggested in comment, choosing the best index requires experience);使用索引,因为它通常会大大降低算法的时间复杂度(例如评论中建议的rtree ,选择最佳索引需要经验);
  • When using Pandas, try to avoid loops: there may have an existing "optimized" shortcut to perform your aggregation without resorting to an explicit loop.使用 Pandas 时,尽量避免循环:可能有一个现有的“优化”快捷方式来执行聚合,而无需求助于显式循环。 It will ease maintenance and make your code more readable.它将简化维护并使您的代码更具可读性。

Towards a solution寻求解决方案

Also notice that GeoPandas has a spatial join geopandas.sjoin method (which relies on rtree ) that can join rows on spatial operations (default is intersection).还要注意,GeoPandas 有一个空间连接geopandas.sjoin方法(它依赖于rtree ),它可以连接空间操作的行(默认为交集)。

I have the feeling that you can solve your problem by making a left intersect join followed by a group-by.我有一种感觉,您可以通过进行左交叉连接然后进行分组来解决您的问题。 Then you could apply different logics on different bunches of rows and aggregates.然后,您可以对不同的行和聚合束应用不同的逻辑。

For the example sake, lets say we want to find count number of intersection and the area covered for all polygons in seed_df and distribute seed_value based on the ratio of intersecting areas.举个例子,假设我们想在seed_df找到所有多边形的交叉点计数和覆盖面积,并根据交叉面积的比率分配seed_value We could achieve it like this:我们可以这样实现:

# Merge datafarmes on geometry intersection
# keep geometries in seed_df which does not intersects at all:
df = gpd.sjoin(seed_df, leech_df, how='left').reset_index()
df = df.rename(columns={'index': 'seed_id', 'geometry': 'seed_geom', 'index_right': 'leech_id'})

# Add leech_df geometry to merged dataframe:
df = df.merge(leech_df, left_on='leech_id', right_index=True, how='left')
df = df.rename(columns={'geometry': 'leech_geom'})

# Create a function computing intersection area (fault tolerant)
def func(x):
    if x['leech_geom']:
        return x['seed_geom'].intersection(x['leech_geom']).area

# Apply function:
df['intersection'] = df.apply(func, axis=1)

The results looks like ( df.tails(4) ):结果看起来像( df.tails(4) ):

    seed_id                                          seed_geom  seed_value  \
8         5  POLYGON ((0.25000 0.50000, 0.25000 0.75000, 0....          10   
9         6  POLYGON ((0.50000 0.00000, 0.50000 0.25000, 0....          10   
10        7  POLYGON ((0.50000 0.25000, 0.50000 0.50000, 0....          10   
11        8  POLYGON ((0.50000 0.50000, 0.50000 0.75000, 0....          10   

    leech_id                                         leech_geom  intersection  
8        2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  
9        NaN                                               None           NaN  
10       2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  
11       2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  

At this point, we group by and aggregate:此时,我们分组聚合:

# Group by and aggregate:
agg = df.groupby('seed_id')['int_area'].agg(['count', 'sum']).reset_index()
agg = agg.rename(columns={'count': 'int_count', 'sum': 'int_sum'})

It leads to:它导致:

   seed_id  int_count   int_sum
0        0          1  0.015625
1        1          2  0.015625
2        2          2  0.022500
3        3          1  0.015625
4        4          2  0.046875
5        5          1  0.031250
6        6          0  0.000000
7        7          1  0.031250
8        8          1  0.031250

Then, we merge aggregate against our original dataframe and perform final computations:然后,我们将聚合与原始数据帧合并并执行最终计算:

final = df.merge(agg)
final['leech_value'] = final['seed_value']*final['int_area']/final['int_sum']

The final result is:最终结果是:

    seed_id  seed_value  leech_id  int_area  int_count   int_sum  leech_value
0         0          10       0.0  0.015625          1  0.015625    10.000000
1         1          10       0.0  0.015625          2  0.015625    10.000000
2         1          10       2.0  0.000000          2  0.015625     0.000000
3         2          10       1.0  0.022500          2  0.022500    10.000000
4         2          10       2.0  0.000000          2  0.022500     0.000000
5         3          10       0.0  0.015625          1  0.015625    10.000000
6         4          10       0.0  0.015625          2  0.046875     3.333333
7         4          10       2.0  0.031250          2  0.046875     6.666667
8         5          10       2.0  0.031250          1  0.031250    10.000000
9         6          10       NaN       NaN          0  0.000000          NaN
10        7          10       2.0  0.031250          1  0.031250    10.000000
11        8          10       2.0  0.031250          1  0.031250    10.000000

Which allocates seed_value to polygons in leech_df based on the ratio of overlapping intersection.其分配seed_value在多边形leech_df基于重叠相交的比率。

If you wish to know how leech_value is distributed to leech polygons, then aggregate once more:如果你想知道leech_value是如何分布到leech_value多边形的,那么再聚合一次:

final.groupby('leech_id')['leech_value'].sum()

It returns:它返回:

leech_id
0.0    33.333333
1.0    10.000000
2.0    36.666667

Graphically:图形化:

在此处输入图片说明

Note笔记

To make sjoin work you need to install rtree and first you need to install libspatialindex-dev library.要使sjoin工作,您需要安装rtree ,首先您需要安装libspatialindex-dev库。 On Debian it boils down to:在 Debian 上,它归结为:

apt-get update && apt-get install libspatialindex-dev
python3 -m pip install --upgrade geopandas

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM