简体   繁体   中英

Convert Geo json with nested lists to pandas dataframe

I've a massive geo json in this form:

 {'features': [{'properties': {'MARKET': 'Albany',
    'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
       [-74.262041, 42.425856, 0],
       [-74.261175, 42.427631, 0],
       [-74.260384, 42.429253, 0]]],
     'type': 'Polygon'}}},
  {'properties': {'MARKET': 'Albany',
    'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
       [-73.929114, 42.081658, 0]]],
     'type': 'Polygon'}}},
  {'properties': {'MARKET': 'Albuquerque',
    'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
       [-74.76786, 43.114496, 0],
       [-74.766474, 43.114656, 0]]],
     'type': 'Polygon'}}}],
 'type': 'FeatureCollection'}

After reading the json:

import json
with open('x.json') as f:
    data = json.load(f)

I read the values into a list and then into a dataframe:

#to get a list of all markets
mkt=set([f['properties']['MARKET'] for f in data['features']])

#to create a list of market and associated lat long
markets=[(market,list(chain.from_iterable(f['geometry']['coordinates']))) for f in data['features'] for market in mkt if f['properties']['MARKET']==mkt]

df = pd.DataFrame(markets[0:], columns=['a','b'])     

First few rows of df are:

      a       b
0   Albany  [[-74.264948, 42.419877, 0], [-74.262041, 42.4...
1   Albany  [[-73.929627, 42.078788, 0], [-73.929114, 42.0...
2   Albany  [[-74.769198, 43.114089, 0], [-74.76786, 43.11...

Then to unnest the nested list in column b, I used pandas concat :

df1 = pd.concat([df.iloc[:,0:1], df['b'].apply(pd.Series)], axis=1)

But this is creating 8070 columns with many NaNs. Is there a way to group all the latitudes and longitudes by the Market (column a)? A million rows by two column dataframe is desired.

Desired op is:

mkt         lat         long 
Albany      42.419877   -74.264948
Albany      42.078788   -73.929627
..
Albuquerque  35.105361   -106.640342

Pls note that the zero in the list element ([-74.769198, 43.114089, 0]) needs to be ignored.

Something like this??

from pandas.io.json import json_normalize
df = json_normalize(geojson["features"])

coords = 'properties.geometry.coordinates'

df2 = (df[coords].apply(lambda r: [(i[0],i[1]) for i in r[0]])
           .apply(pd.Series).stack()
           .reset_index(level=1).rename(columns={0:coords,"level_1":"point"})
           .join(df.drop(coords,1), how='left')).reset_index(level=0)

df2[['lat','long']] = df2[coords].apply(pd.Series)

df2

Outputs:

   index  point properties.geometry.coordinates properties.MARKET  \
0      0      0         (-74.264948, 42.419877)            Albany   
1      0      1         (-74.262041, 42.425856)            Albany   
2      0      2         (-74.261175, 42.427631)            Albany   
3      0      3         (-74.260384, 42.429253)            Albany   
4      1      0         (-73.929627, 42.078788)            Albany   
5      1      1         (-73.929114, 42.081658)            Albany   
6      2      0         (-74.769198, 43.114089)       Albuquerque   
7      2      1          (-74.76786, 43.114496)       Albuquerque   
8      2      2         (-74.766474, 43.114656)       Albuquerque   

  properties.geometry.type        lat       long  
0                  Polygon -74.264948  42.419877  
1                  Polygon -74.262041  42.425856  
2                  Polygon -74.261175  42.427631  
3                  Polygon -74.260384  42.429253  
4                  Polygon -73.929627  42.078788  
5                  Polygon -73.929114  42.081658  
6                  Polygon -74.769198  43.114089  
7                  Polygon -74.767860  43.114496  
8                  Polygon -74.766474  43.114656 

If:

geojson = {'features': [{'properties': {'MARKET': 'Albany',
    'geometry': {'coordinates': [[[-74.264948, 42.419877, 0],
       [-74.262041, 42.425856, 0],
       [-74.261175, 42.427631, 0],
       [-74.260384, 42.429253, 0]]],
     'type': 'Polygon'}}},
  {'properties': {'MARKET': 'Albany',
    'geometry': {'coordinates': [[[-73.929627, 42.078788, 0],
       [-73.929114, 42.081658, 0]]],
     'type': 'Polygon'}}},
  {'properties': {'MARKET': 'Albuquerque',
    'geometry': {'coordinates': [[[-74.769198, 43.114089, 0],
       [-74.76786, 43.114496, 0],
       [-74.766474, 43.114656, 0]]],
     'type': 'Polygon'}}}],
 'type': 'FeatureCollection'}

@Anton_vBR gave a great answer!

However, consider also "geopandas" library as an alternative:

import geopandas

df = geopandas.read_file("yourfile.geojson")

where df will be "class geopandas.GeoDataFrame", which will allow you to manipulate geojson as it is like ordinary pandas's DataFrame (recursively through structures inside)

The answers above are excellent, but here's something a little different. The Awkward Array library (note: I'm the author) is meant for working with nested data structures like this at large scale. As a coincidence, I used a GeoJSON file as a motivating example in the documentation, though I'm working on a few more tutorials that take larger Parquet files as example data, unrelated to geography.

(That's how this is different from @kamal-barshevich's geopandas answer: geopandas is a domain-specific library that "knows about" geography and will likely have functionality relevant to domain experts in that field. Awkward Array is a generic library for manipulating data structures that doesn't know anything about geography.)

The documentation I've linked above has some examples of manipulating a GeoJSON file with the array functions themselves, without Pandas, starting from here:

>>> import urllib.request
>>> import awkward as ak
>>> 
>>> url = "https://raw.githubusercontent.com/Chicago/osd-bike-routes/master/data/Bikeroutes.geojson"
>>> bikeroutes_json = urllib.request.urlopen(url).read()
>>> bikeroutes = ak.from_json(bikeroutes_json)
>>> bikeroutes
<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>

But in this answer, I'll make the Pandas structure that you want. The ak.to_pandas function turns nested lists into a MultiIndex . Applying it to just the "coordinates" inside "geometry" inside "features" :

>>> bikeroutes.features.geometry.coordinates
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>
>>> 
>>> ak.to_pandas(bikeroutes.features.geometry.coordinates)
                                              values
entry subentry subsubentry subsubsubentry           
0     0        0           0              -87.788573
                           1               41.923652
               1           0              -87.788646
                           1               41.923651
               2           0              -87.788845
...                                              ...
1060  0        8           1               41.950493
               9           0              -87.714819
                           1               41.950724
               10          0              -87.715284
                           1               41.951042

[96724 rows x 1 columns]

The list nesting is three levels deep, the last of which are longitude, latitude pairs (eg [-87.788573, 41.923652] ). You want these in separate columns:

>>> bikeroutes.features.geometry.coordinates[..., 0]
<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>
>>> bikeroutes.features.geometry.coordinates[..., 1]
<Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>

This is using a slice like NumPy's (Awkward Array is a generalization of NumPy), taking everything in the all dimensions except the last ( ... ); the first expression extracts items 0 (longitude) and the second extracts items 1 (latitude).

We can combine these in a new record type using ak.zip to give them column names:

>>> ak.to_pandas(ak.zip({
...     "longitude": bikeroutes.features.geometry.coordinates[..., 0],
...     "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... }))
                            longitude   latitude
entry subentry subsubentry                      
0     0        0           -87.788573  41.923652
               1           -87.788646  41.923651
               2           -87.788845  41.923650
               3           -87.788951  41.923649
               4           -87.789092  41.923648
...                               ...        ...
1060  0        6           -87.714026  41.950199
               7           -87.714335  41.950388
               8           -87.714486  41.950493
               9           -87.714819  41.950724
               10          -87.715284  41.951042

[48362 rows x 2 columns]

and this is pretty close to what you're looking for. The one last thing you wanted was to match each of these with one of the "properties" from "features" . My GeoJSON file doesn't have "MARKET" :

>>> bikeroutes.features.properties.type
1061 * {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}

but "STREET" may be a good stand-in. These properties are at a different level of nesting than the coordinates:

>>> bikeroutes.features.geometry.coordinates[..., 0].type
1061 * var * var * float64
>>> bikeroutes.features.properties.STREET.type
1061 * string

The longitude points are two levels of nested lists deeper than the street names, but ak.zip broadcasts them down (similar to NumPy's concept of broadcasting , with necessary extensions for variable-length lists ).

The final expression is:

>>> ak.to_pandas(ak.zip({
...     "longitude": bikeroutes.features.geometry.coordinates[..., 0],
...     "latitude": bikeroutes.features.geometry.coordinates[..., 1],
...     "street": bikeroutes.features.properties.STREET,
... }))
                            longitude   latitude           street
entry subentry subsubentry                                       
0     0        0           -87.788573  41.923652  W FULLERTON AVE
               1           -87.788646  41.923651  W FULLERTON AVE
               2           -87.788845  41.923650  W FULLERTON AVE
               3           -87.788951  41.923649  W FULLERTON AVE
               4           -87.789092  41.923648  W FULLERTON AVE
...                               ...        ...              ...
1060  0        6           -87.714026  41.950199     N ELSTON AVE
               7           -87.714335  41.950388     N ELSTON AVE
               8           -87.714486  41.950493     N ELSTON AVE
               9           -87.714819  41.950724     N ELSTON AVE
               10          -87.715284  41.951042     N ELSTON AVE

[48362 rows x 3 columns]

Since you just want to relate market with longitude, latitude points, you can ignore the MultiIndex, or you can use Pandas functions to turn the components of that MultiIndex into columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM