Running Analysis on Multiple geojson files

Question

I have about 113 geojson files that I've previously mainly dealt with in QGIS. My goal now is to be able to simultaneously import all files into R and conduct analyses on the underlying attribute tables attached to each respective layer. I have already figured out the best way to import one file and conduct any needed analysis after converting into a data frame. The files that I have in the folder all look like this: 0cfb16c1-90c2-412d-bb60-2fec34c75e9a.geojson

The code I used for this step was:

library(rgdal)
map1 <- readOGR(dsn = "/Users/chris/Documents/GeorgetownMPPMSFS/McCourtMPP/BIGWork/BIGDataFiles/maps/sampled_maps/0cfb16c1-90c2-412d-bb60-2fec34c75e9a.geojson", layer = "0cfb16c1-90c2-412d-bb60-2fec34c75e9a")
summary(map1)
map1 <- as.data.frame(map1)

I want to run the same analysis I did on that map on all geojson files without having to do so one by one. The analysis I conducted related to electoral redistricting metrics and is included here:

cfbdata$reptotal <- (cfbdata$surveyed_republican_percentage/100)*cfbdata$surveyed_total
cfbdata$demtotal <- (cfbdata$surveyed_democrat_percentage/100)*cfbdata$surveyed_total
cfbdata$NAME <- NULL
aggdata <-aggregate(cfbdata, by=list(cfbdata$cluster), 
                    FUN=sum, na.rm=TRUE)
# Rep district victory is 1 and Dem district victory is 0
aggdata$result <- ifelse(aggdata$reptotal > aggdata$demtotal,1, ifelse(aggdata$demtotal > aggdata$reptotal,0, NA))

EffGapCalc <- subset(aggdata, select=c("cluster","reptotal","demtotal","surveyed_total", "result"))

# Step 1: Calculate Dem Wasted, Rep Wasted, and Net Wasted

EffGapCalc$repwasted <- ifelse(EffGapCalc$result == 1, EffGapCalc$reptotal - (.51*EffGapCalc$surveyed_total), ifelse(EffGapCalc$result == 0, EffGapCalc$reptotal, NA))

EffGapCalc$demwasted <- ifelse(EffGapCalc$result == 0, EffGapCalc$demtotal - (.51 * EffGapCalc$surveyed_total), ifelse(EffGapCalc$result == 1, EffGapCalc$demtotal, NA))

EffGapCalc$netwasted <- abs(EffGapCalc$repwasted - EffGapCalc$demwasted)

# Step 2: Sum Total Wasted Rep and Dem Votes
totrepwasted <- sum(EffGapCalc$repwasted)
totdemwasted <- sum(EffGapCalc$demwasted)
netwaste <- ifelse(totrepwasted>totdemwasted, totrepwasted-totdemwasted, ifelse(totrepwasted<totdemwasted, totdemwasted-totrepwasted))
netwaste
# Democrats had a net waste (more wasted votes) of 74289.6

# Step 3: Divide Net Wasted by Total Number of Votes Case
sum(EffGapCalc$surveyed_total)
totalsurvtot <- sum(EffGapCalc$surveyed_total)
netwaste/totalsurvtot
# Efficiency Gap = .0359 [3.60%]

The goal is to run that same analysis for all 113 GEOJSON files and get a list of 113 "Efficiency Gap" numbers like the .0359 above.

I've searched through a number of questions on stackoverflow and elsewhere but have not found a suitable solution. While I initially thought a for loop would be best for this, based on what I've read elsewhere, it appears that lapply() actually might be the better route to go. The challenge I am having is ensuring the right import as part of 'lapply()'

The code I tried using that failed was:

library(rgdal)
fileNames <- list.files(path = "/Users/chris/Documents/GeorgetownMPPMSFS/McCourtMPP/BIGWork/BIGDataFiles/maps/sampled_maps", pattern="*.geojson", full.names = TRUE)

lapply(fileNames, function(x) {
  map1 <- readOGR(dsn = x, layer = x)
  map1 <- as.data.frame(map1)
  out <- map1$reptotal <- (map1$surveyed_republican_percentage/100)*map1$surveyed_total;
  map1$demtotal <- (map1$surveyed_democrat_percentage/100)*map1$surveyed_total;
  map1$NAME <- NULL;
  aggdata <-aggregate(map1, by=list(map1$cluster), 
                      FUN=sum, na.rm=TRUE);
  aggdata$result <- ifelse(aggdata$reptotal > aggdata$demtotal,1, ifelse(aggdata$demtotal > aggdata$reptotal,0, NA));

  EffGapCalc <- subset(aggdata, select=c("cluster","reptotal","demtotal","surveyed_total", "result"));
  # Step 1: Calculate Dem Wasted, Rep Wasted, and Net Wasted
  EffGapCalc$repwasted <- ifelse(EffGapCalc$result == 1, EffGapCalc$reptotal - (.51*EffGapCalc$surveyed_total), ifelse(EffGapCalc$result == 0, EffGapCalc$reptotal, NA));

  EffGapCalc$demwasted <- ifelse(EffGapCalc$result == 0, EffGapCalc$demtotal - (.51 * EffGapCalc$surveyed_total), ifelse(EffGapCalc$result == 1, EffGapCalc$demtotal, NA));

  EffGapCalc$netwasted <- abs(EffGapCalc$repwasted - EffGapCalc$demwasted);

  # Step 2: Sum Total Wasted Rep and Dem Votes
  totrepwasted <- sum(EffGapCalc$repwasted);
  totdemwasted <- sum(EffGapCalc$demwasted);
  netwaste <- ifelse(totrepwasted>totdemwasted, totrepwasted-totdemwasted, ifelse(totrepwasted<totdemwasted, totdemwasted-totrepwasted));
  netwaste

  # Step 3: Divide Net Wasted by Total Number of Votes Case
  totalsurvtot <- sum(EffGapCalc$surveyed_total);
  netwaste/totalsurvtot;

  write.table(out, "/Users/chris/Documents/GeorgetownMPPMSFS/McCourtMPP/BIGWork/BIGDataFiles", sep="\t", quote=F, row.names=F, col.names=T)
})

I've been trying to figure this out for two days at this point and am only getting more confused. Any help will be much appreciated!

Answer 1

Simple test code:

lapply(fileNames, function(x) {
  map1 <- readOGR(dsn = x, layer = x)
}

Assuming that fails for your case, we know the problem is in that one line. That makes it easier for someone here to see its a simpler problem. Please always try and minimise your problems, it will help us help you and in many cases it lets you solve it yourself. Proceeding...

readOGR for a geoJSON needs a file path and a layer name, and that code is going to feed the file path as the layer name, like this, using a test file from the geojson package::

> testfile <- list.files(path = path, pattern="*.geojson", full.names = TRUE)[5]

quick check we've got it:

> file.exists(testfile)
[1] TRUE

Then try and read:

> d = readOGR(dsn=testfile, layer=testfile)
Error in ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv,  : 
  Cannot open layer

So how do we get the layer name from the file path? We have ogrListLayers for that:

> ogrListLayers(testfile)
[1] "OGRGeoJSON"
attr(,"driver")
[1] "GeoJSON"
attr(,"nlayers")
[1] 1

Now that looks pretty weird but its a vector of layer names and some extra attributes that you can ignore for this purpose. The layer name of this test layer is "OGRGeoJSON". Assuming your geoJSONs are known to only be one layer you can do:

> d = readOGR(dsn=testfile, layer=ogrListLayers(testfile))
OGR data source with driver: GeoJSON 
Source: "/home/rowlings/R/x86_64-pc-linux-gnu-library/3.4/geojson/examples/linestring_one.geojson", layer: "OGRGeoJSON"
with 1 features
It has 2 fields
Warning message:
In readOGR(dsn = testfile, layer = ogrListLayers(testfile)) :
  Z-dimension discarded

Now I think that either geoJSONs can only have one layer, or that readOGR defaults to the first layer, so if you know there's only one layer in your geoJSONs you can leave out the layer= argument and get an identical object returned:

> d2 = readOGR(dsn=testfile)
OGR data source with driver: GeoJSON 
Source: "/home/rowlings/R/x86_64-pc-linux-gnu-library/3.4/geojson/examples/linestring_one.geojson", layer: "OGRGeoJSON"
with 1 features
It has 2 fields
Warning message:
In readOGR(dsn = testfile) : Z-dimension discarded

Running Analysis on Multiple geojson files

Question

1 answers

solution1
1 ACCPTED 2018-01-08 08:06:10

Running Analysis on Multiple geojson files

Question

1 answers

solution1 1 ACCPTED 2018-01-08 08:06:10

solution1
1 ACCPTED 2018-01-08 08:06:10