简体   繁体   English

将 csv 文件加入 shapefile 的快速有效方法

[英]Fast and efficient way of joining a csv file to a shapefile

I'm trying to join a csv file with millions of rows to a shapefile using JoinField but it's taking forever.我正在尝试使用 JoinField 将一个包含数百万行的 csv 文件加入到一个 shapefile 中,但这需要很长时间。 And when the join is completed, I get 0 in all rows of the joined field.连接完成后,连接字段的所有行都为 0。 I also tried using dictionaries with UpdateCursor but the join didn't happen.我还尝试将字典与 UpdateCursor 一起使用,但没有发生连接。 Is there a better way to do this?有一个更好的方法吗?

The JoinField code I used is:我使用的 JoinField 代码是:

arcpy.MakeFeatureLayer_management("mukey.shp", "mapunit")
arcpy.CopyRows_management(kvalues_path, "kvalues")   #to give the table OIDs
arcpy.JoinField_management("mapunit", "mukey", "kvalues", "mukey", "ksat_mday")

"mukey" is the common field between the csv file and the shapefile, while "ksat_mday" is the field that I want to join to the shapefile. “mukey”是 csv 文件和 shapefile 之间的公共字段,而“ksat_mday”是我要加入到 shapefile 的字段。

The dictionary with UpdateCursor code I used was for replacing a Join connecting two Feature Classes.我使用的带有 UpdateCursor 代码的字典用于替换连接两个要素类的连接。 May be the code didn't work because I was joining a csv file to a shapefile and not two Feature Classes.可能是代码不起作用,因为我将 csv 文件加入到 shapefile 而不是两个要素类。 The code was taken from https://community.esri.com/t5/python-blog/turbo-charging-data-manipulation-with-python/ba-p/884079 .代码取自https://community.esri.com/t5/python-blog/turbo-charging-data-manipulation-with-python/ba-p/884079

Usually, such concatenating and joins are faster if you use Geopandas instead of arcpy, but since you stated that you are working with millions of rows, which implies heavy data frames you are working with here is my suggestion:通常,如果您使用 Geopandas 而不是 arcpy,这种连接和连接会更快,但由于您声明您正在处理数百万行,这意味着您正在使用大量数据帧,这是我的建议:

  1. Geopandas and Pandas are relatively slow libraries due to implementing one core of CPU at a time, but the best new alternative is Pollar which exactly works like Pandas data frame; Geopandas 和 Pandas 是相对较慢的库,因为一次只实现一个 CPU 核心,但最好的新替代方案是 Pollar,它与 Pandas 数据框完全一样; however, by parallelizing the workflow over all CPU cores, you can see the results in seconds rather than days.但是,通过在所有 CPU 内核上并行处理工作流,您可以在几秒钟而不是几天内看到结果。 All you need to do is:您需要做的就是:

Install Polars in an environment:在环境中安装 Polars:

$ pip install polars

or要么

$ conda install polars

then using this syntax convert your tables to an excel data sheet in the ongoing environment you are working with arcpy然后使用此语法在您使用 arcpy 的持续环境中将您的表转换为 excel 数据表

arcpy.conversion.TableToExcel(Input_Table, Output_Excel_File, {Use_field_alias_as_column_header}, {Use_domain_and_subtype_description})

then you can easily read the tables you have and join them by polars which exactly works like Pandas and Geopandas.然后您可以轻松阅读您拥有的表格并通过极地加入它们,这与 Pandas 和 Geopandas 完全一样。

in the end, make sure to regenerate your shapefiles.最后,确保重新生成 shapefile。

arcpy.conversion.ExcelToTable(Input_Excel_File, Output_Table, {Sheet}, {field_names_row}, {cell_range})

Please let me know if you encountered any issues regarding this solution如果您遇到有关此解决方案的任何问题,请告诉我

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM