[英]How can I improve the performance of my script?
I have a "seed" GeoDataFrame (GDF)(RED) which contains a 0.5 arc minutes global grid ((180*2)*(360*2) = 259200).我有一个“种子”GeoDataFrame (GDF)(RED),它包含一个 0.5 弧分的全局网格 ((180*2)*(360*2) = 259200)。 Each cell contains an absolute population estimate.
每个单元格都包含一个绝对人口估计值。 In addition, I have a "leech" GDF (GREEN) with roughly 8250 adjoining non-regular shapes of various sizes (watersheds).
此外,我有一个“水蛭”GDF(绿色),其中包含大约 8250 个相邻的各种大小(分水岭)的非常规形状。
I wrote a script to allocate the population estimates to the geometries in the leech GDF based on the overlapping area between grid cells (seed GDF) and the geometries in the leech GDF.我编写了一个脚本,根据网格单元(种子 GDF)和水蛭 GDF 中的几何形状之间的重叠区域,将种群估计值分配给水蛭 GDF 中的几何形状。 The script works perfectly fine for my sample data (see below).
该脚本非常适合我的示例数据(见下文)。 However, once I run it on my actual data, it is very slow.
然而,一旦我在我的实际数据上运行它,它就非常慢。 I ran it overnight and the next morning only 27% of the calculations had been performed.
我运行了一夜,第二天早上只执行了 27% 的计算。 I will have to run this script many times and waiting for two days each time, is simply not an option.
我将不得不多次运行此脚本并且每次都等待两天,这根本不是一种选择。
After doing a bit of literature research, I already replaced (?) for loops with for index i in df.iterrows()
(or is this the same as "conventional" python for loops) but it didn't bring about the performance imporvement I had hoped for.在做了一些文献研究之后,我已经
for index i in df.iterrows()
中用for index i in df.iterrows()
替换了 (?) for 循环(或者这与“常规”python for 循环相同)但它并没有带来性能提升我曾希望。
Any suggestion son how I can speed up my code?任何建议儿子如何加快我的代码? In twelve hours, my script only processed only ~30000 rows out of ~200000.
在 12 小时内,我的脚本仅处理了约 200000 行中的约 30000 行。
My expected output is the column leech_df['leeched_values']
.我的预期输出是列
leech_df['leeched_values']
。
import geopandas as gpd
import time
from datetime import datetime
from shapely.geometry import Polygon
# =============================================================================
# Geometries for testing
# =============================================================================
polys1 = gpd.GeoSeries([Polygon([(0.00,0.00), (0.00,0.25), (0.25,0.25), (0.25,0.00)]),
Polygon([(0.00,0.25), (0.00,0.50), (0.25,0.50), (0.25,0.25)]),
Polygon([(0.00,0.50), (0.00,0.75), (0.25,0.75), (0.25,0.50)]),
Polygon([(0.25,0.00), (0.25,0.25), (0.50,0.25), (0.50,0.00)]),
Polygon([(0.25,0.25), (0.25,0.50), (0.50,0.50), (0.50,0.25)]),
Polygon([(0.25,0.50), (0.25,0.75), (0.50,0.75), (0.50,0.50)]),
Polygon([(0.50,0.00), (0.50,0.25), (0.75,0.25), (0.75,0.00)]),
Polygon([(0.50,0.25), (0.50,0.50), (0.75,0.50), (0.75,0.25)]),
Polygon([(0.50,0.50), (0.50,0.75), (0.75,0.75), (0.75,0.50)]),
])
polys2 = gpd.GeoSeries([Polygon([(0.125,0.125), (0.125,0.375), (0.375,0.375), (0.375,0.125)]),
Polygon([(0.050,0.550), (0.050,0.700), (0.200,0.700), (0.200,0.550)]),
Polygon([(0.25,0.625), (0.25,0.375), (0.750,0.375), (0.750,0.625)]),
])
seed_df = gpd.GeoDataFrame({'geometry': polys1, 'seed_value':[10,10,10,10,10,10,10,10,10]})
leech_df = gpd.GeoDataFrame({'geometry': polys2})
del polys1, polys2
seed_value = 'seed_value'
# =============================================================================
# Prepare DataFrames
# =============================================================================
start = time.time()
print('\n\nPrepare DataFrames ... ')
# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
seed_df = seed_df.reset_index()
seed_df = seed_df.rename(columns={'index': 'seed_index'})
# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
leech_df = leech_df.reset_index()
leech_df = leech_df.rename(columns={'index': 'leech_index'})
end = time.time()
print(end - start)
# Add the area to the geometries
start = time.time()
print('Calculating the area of the leech DF geometries ...')
leech_df['leech_area'] = leech_df['geometry'].area
print('Calculating the area of the seed DF geometries ...')
seed_df['seed_area'] = seed_df['geometry'].area
leech_df['leeched_value'] = 0.0
end = time.time()
print(end - start)
# =============================================================================
# Merge seed and leech data and count overlaps
# =============================================================================
start = time.time()
print('Merging DataFrames ... ')
merged_df = gpd.overlay(leech_df, seed_df, how='union')
# Drop NaNs
merged_df = merged_df.dropna(axis='rows')
# =============================================================================
# Allocate seed values
# =============================================================================
# Count with how many leech geometries each seed geometry overlaps
print('Count overlaps ... ')
overlaps = merged_df['seed_index'].value_counts()
neglected_values = list()
one_overlaps_values = list()
more_overlaps_values = list()
no_overlaps = list()
one_overlaps = list()
more_overlaps = list()
end = time.time()
print(end - start)
start = time.time()
print('Allocate seed values ... ')
i = 0
for index, row in seed_df.iterrows(): # For each row in seed_df MAX 70123
if row['seed_index'] % 1000 == 0:
print(str(row['seed_index'])+'k at '+str(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
# If seed geometry does not have an overlap
# Get the value with the seed_index == 0
# Otherwise return 0
# So whenever, the seedindex does not turn up in overlaps return zero
if overlaps.get(row['seed_index'],0) == 0: # If seed geometry does not have an overlap
#print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': No overlap')
no_overlaps.append(1) # Count the values which have not been considered
neglected_values.append(row[seed_value]) # Count the values which have not been considered
i = i + 1
elif overlaps.get(row['seed_index'],0) == 1:
#print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': One overlap')
one_overlaps.append(1)
# What is for row the leech index (with which leech geometry does an overlap exist)?
temp_int = int(merged_df[merged_df['seed_index'] == row['seed_index']]['leech_index'])
# For this leech index replace leeched_value with leeched_value + leeched_value
seed_value_amount = int(seed_df[seed_df['seed_index'] == row['seed_index']][seed_value])
leech_df.loc[temp_int,'leeched_value'] = leech_df.loc[temp_int,'leeched_value'] + seed_value_amount
one_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
i = i + 1
elif overlaps.get(row['seed_index'],0) > 1:
#print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': More than one overlap')
more_overlaps.append(1)
# Create a DF which contains the overlaps of the seed geometries with geoemtries of the leech df
temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]
# Calculate the overlaying area
temp_df['overlay_area'] = temp_df['geometry'].area
# And comapre this to the total overlaying area of this grid cell
temp_df['rel_contribution'] = 0.0
temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)
# Calcualte the absolute distribution of the seed value to each leech row ('leech index')
temp_df['abs_contribution'] = temp_df[seed_value]*temp_df['rel_contribution']
# Reset index
temp_df = temp_df.reset_index(drop=True)
more_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
i = i + 1
#For each overlap between grid cell (seed) and leech geometry:
for j in range(len(leech_df)):
#print('== LEECH ' + str(j) + '.' +str(i))
# Create a DF which only contains the temp_df row for the specific leech geometry
contribution = temp_df[temp_df['leech_index'] == j]
# Save the contribution to the leech_df
#leech_df['leeched_value'][j] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])
leech_df.loc[j,'leeched_value'] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])
end = time.time()
print(end - start)
print('Done')
# =============================================================================
# Data check
# =============================================================================
print('>1 Overlaps: ' + str(sum(more_overlaps)) + ' (sum neglected values ' + str(sum(more_overlaps_values)) + ')' )
print('=1 Overlaps: ' + str(sum(one_overlaps)) + ' (sum neglected values ' + str(sum(one_overlaps_values)) + ')' )
print('No Overlaps: ' + str(sum(no_overlaps)) + ' (sum neglected values ' + str(sum(neglected_values)) + ')' )
print('Unconsidered: ' + str(sum(seed_df[seed_value]) - (sum(more_overlaps_values)+sum(one_overlaps_values)+sum(neglected_values))))
# =============================================================================
# Plot
# =============================================================================
# Plot
base = seed_df.plot(color='red', edgecolor='black');
leech_df.plot(ax=base, color='green', alpha=0.5);
EDIT: pip freeze > requirements_output.txt
returns:编辑:
pip freeze > requirements_output.txt
返回:
affine==2.3.0
alabaster==0.7.12
appdirs==1.4.3
argh==0.26.2
asn1crypto==1.3.0
astroid==2.3.3
atomicwrites==1.3.0
attrs==19.3.0
autopep8==1.4.4
Babel==2.8.0
backcall==0.1.0
bcrypt==3.1.7
bleach==3.1.0
cachetools==4.0.0
Cartopy==0.17.0
certifi==2019.11.28
cffi==1.13.2
cftime==1.0.4.2
chardet==3.0.4
Click==7.0
click-plugins==1.1.1
cligj==0.5.0
cloudpickle==1.2.2
colorama==0.4.3
country-converter==0.6.6
cryptography==2.8
cycler==0.10.0
dask==2.10.0
datacube==1.7
decorator==4.4.1
defusedxml==0.6.0
Deprecated==1.2.7
descartes==1.1.0
diff-match-patch==20181111
docutils==0.16
entrypoints==0.3
Fiona==1.8.13
flake8==3.7.9
future==0.18.2
GDAL==3.0.2
geocube==0.0.10
geopandas==0.6.2
h5py==2.9.0
idna==2.8
imageio==2.6.1
imagesize==1.2.0
importlib-metadata==1.4.0
intervaltree==3.0.2
ipykernel==5.1.4
ipython==7.11.1
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.14.1
Jinja2==2.10.3
joblib==0.14.1
jsonschema==3.2.0
jupyter-client==5.3.4
jupyter-core==4.6.1
keyring==21.1.0
kiwisolver==1.1.0
lark-parser==0.8.1
lazy-object-proxy==1.4.3
lxml==4.4.2
mapclassify==2.2.0
MarkupSafe==1.1.1
matplotlib==3.1.1
mccabe==0.6.1
mistune==0.8.4
mkl-fft==1.0.15
mkl-random==1.1.0
mkl-service==2.3.0
more-itertools==8.0.2
munch==2.5.0
nbconvert==5.6.1
nbformat==5.0.4
netCDF4==1.5.3
notebook==6.0.0
numpy==1.16.4
numpydoc==0.9.2
olefile==0.46
OWSLib==0.19.1
packaging==20.1
pandas==0.25.0
pandocfilters==1.4.2
paramiko==2.6.0
parso==0.6.0
pathtools==0.1.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.0.0
pluggy==0.13.1
prometheus-client==0.7.1
prompt-toolkit==3.0.3
psutil==5.6.7
psycopg2==2.8.4
pycodestyle==2.5.0
pycparser==2.19
pydocstyle==4.0.1
pyepsg==0.4.0
pyflakes==2.1.1
Pygments==2.5.2
pykdtree==1.3.1
pylint==2.4.4
pymatreader==0.0.20
Pympler==0.7
pymrio==0.4.0
PyNaCl==1.3.0
pyOpenSSL==19.1.0
pyparsing==2.4.6
pyPEG2==2.15.2
pyproj==2.4.2.post1
PyQt5-sip==4.19.19
pyrsistent==0.15.7
pyshp==2.1.0
PySocks==1.7.1
python-dateutil==2.8.1
python-jsonrpc-server==0.3.4
python-language-server==0.31.7
pytz==2019.3
pywin32==227
pywin32-ctypes==0.2.0
pywinpty==0.5.7
PyYAML==5.2
pyzmq==18.1.0
QDarkStyle==2.8
QtAwesome==0.6.1
qtconsole==4.6.0
QtPy==1.9.0
rasterio==1.1.2
requests==2.22.0
rioxarray==0.0.19
rope==0.16.0
Rtree==0.9.3
scikit-learn==0.22.1
scipy==1.3.2
Send2Trash==1.5.0
Shapely==1.7.0
singledispatch==3.4.0.3
six==1.14.0
snowballstemmer==2.0.0
snuggs==1.4.7
sortedcontainers==2.1.0
Sphinx==2.3.1
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
spyder==4.0.0
spyder-kernels==1.8.1
spyder-notebook==0.1.4
SQLAlchemy==1.3.13
terminado==0.8.3
testpath==0.4.4
toolz==0.10.0
tornado==6.0.3
tqdm==4.43.0
traitlets==4.3.3
ujson==1.35
urllib3==1.25.8
watchdog==0.9.0
wcwidth==0.1.7
webencodings==0.5.1
win-inet-pton==1.1.0
wincertstore==0.2
wrapt==1.11.2
xarray==0.14.1
xlrd==1.2.0
xmltodict==0.12.0
yapf==0.28.0
zipp==0.6.0
It might be worthy to profile your code in details to get precise insights of what is your bottleneck.详细分析您的代码以准确了解您的瓶颈可能是值得的。
Bellow some advises to already improve your script performance:下面是一些已经提高脚本性能的建议:
list.append(1)
to count occurrences, use collection.Counter
instead;list.append(1)
来计算出现次数,使用collection.Counter
代替;pandas.DataFrame.iterrows
, use pandas.DataFrame.itertuples
instead;pandas.DataFrame.iterrows
,使用pandas.DataFrame.itertuples
代替;pandas.DataFrame.fillna
instead:pandas.DataFrame.fillna
: Eg.例如。 this line:
这一行:
temp_df['rel_contribution'] = 0.0
temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)
Eg.例如。 this line performs boolean indexing within a loop:
此行在循环内执行布尔索引:
temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]
rtree
as suggested in comment, choosing the best index requires experience);rtree
,选择最佳索引需要经验);Also notice that GeoPandas has a spatial join geopandas.sjoin
method (which relies on rtree
) that can join rows on spatial operations (default is intersection).还要注意,GeoPandas 有一个空间连接
geopandas.sjoin
方法(它依赖于rtree
),它可以连接空间操作的行(默认为交集)。
I have the feeling that you can solve your problem by making a left intersect join followed by a group-by.我有一种感觉,您可以通过进行左交叉连接然后进行分组来解决您的问题。 Then you could apply different logics on different bunches of rows and aggregates.
然后,您可以对不同的行和聚合束应用不同的逻辑。
For the example sake, lets say we want to find count number of intersection and the area covered for all polygons in seed_df
and distribute seed_value
based on the ratio of intersecting areas.举个例子,假设我们想在
seed_df
找到所有多边形的交叉点计数和覆盖面积,并根据交叉面积的比率分配seed_value
。 We could achieve it like this:我们可以这样实现:
# Merge datafarmes on geometry intersection
# keep geometries in seed_df which does not intersects at all:
df = gpd.sjoin(seed_df, leech_df, how='left').reset_index()
df = df.rename(columns={'index': 'seed_id', 'geometry': 'seed_geom', 'index_right': 'leech_id'})
# Add leech_df geometry to merged dataframe:
df = df.merge(leech_df, left_on='leech_id', right_index=True, how='left')
df = df.rename(columns={'geometry': 'leech_geom'})
# Create a function computing intersection area (fault tolerant)
def func(x):
if x['leech_geom']:
return x['seed_geom'].intersection(x['leech_geom']).area
# Apply function:
df['intersection'] = df.apply(func, axis=1)
The results looks like ( df.tails(4)
):结果看起来像(
df.tails(4)
):
seed_id seed_geom seed_value \
8 5 POLYGON ((0.25000 0.50000, 0.25000 0.75000, 0.... 10
9 6 POLYGON ((0.50000 0.00000, 0.50000 0.25000, 0.... 10
10 7 POLYGON ((0.50000 0.25000, 0.50000 0.50000, 0.... 10
11 8 POLYGON ((0.50000 0.50000, 0.50000 0.75000, 0.... 10
leech_id leech_geom intersection
8 2.0 POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0.... 0.03125
9 NaN None NaN
10 2.0 POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0.... 0.03125
11 2.0 POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0.... 0.03125
At this point, we group by and aggregate:此时,我们分组聚合:
# Group by and aggregate:
agg = df.groupby('seed_id')['int_area'].agg(['count', 'sum']).reset_index()
agg = agg.rename(columns={'count': 'int_count', 'sum': 'int_sum'})
It leads to:它导致:
seed_id int_count int_sum
0 0 1 0.015625
1 1 2 0.015625
2 2 2 0.022500
3 3 1 0.015625
4 4 2 0.046875
5 5 1 0.031250
6 6 0 0.000000
7 7 1 0.031250
8 8 1 0.031250
Then, we merge aggregate against our original dataframe and perform final computations:然后,我们将聚合与原始数据帧合并并执行最终计算:
final = df.merge(agg)
final['leech_value'] = final['seed_value']*final['int_area']/final['int_sum']
The final result is:最终结果是:
seed_id seed_value leech_id int_area int_count int_sum leech_value
0 0 10 0.0 0.015625 1 0.015625 10.000000
1 1 10 0.0 0.015625 2 0.015625 10.000000
2 1 10 2.0 0.000000 2 0.015625 0.000000
3 2 10 1.0 0.022500 2 0.022500 10.000000
4 2 10 2.0 0.000000 2 0.022500 0.000000
5 3 10 0.0 0.015625 1 0.015625 10.000000
6 4 10 0.0 0.015625 2 0.046875 3.333333
7 4 10 2.0 0.031250 2 0.046875 6.666667
8 5 10 2.0 0.031250 1 0.031250 10.000000
9 6 10 NaN NaN 0 0.000000 NaN
10 7 10 2.0 0.031250 1 0.031250 10.000000
11 8 10 2.0 0.031250 1 0.031250 10.000000
Which allocates seed_value
to polygons in leech_df
based on the ratio of overlapping intersection.其分配
seed_value
在多边形leech_df
基于重叠相交的比率。
If you wish to know how leech_value
is distributed to leech polygons, then aggregate once more:如果你想知道
leech_value
是如何分布到leech_value
多边形的,那么再聚合一次:
final.groupby('leech_id')['leech_value'].sum()
It returns:它返回:
leech_id
0.0 33.333333
1.0 10.000000
2.0 36.666667
Graphically:图形化:
To make sjoin
work you need to install rtree
and first you need to install libspatialindex-dev
library.要使
sjoin
工作,您需要安装rtree
,首先您需要安装libspatialindex-dev
库。 On Debian it boils down to:在 Debian 上,它归结为:
apt-get update && apt-get install libspatialindex-dev
python3 -m pip install --upgrade geopandas
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.