Subsetting in readOGR

The function readOGR in the rgdal package is used to bring vector spatial data sources into R. readOGR() relies upon OGR (part of the GDAL/OGR library) for format conversion. Unfortunately, while OGR supports the ability to subset columns (with the -select switch) or rows (with the -where switch), or even to request a layer using a full SQL statement, none of that functionality is available in readOGR. Every so often this gets discussed in the mailing lists, but the functionality has not yet been implemented.

If your data is already in a SQL database though, you’re in luck. You can accomplish the same thing within R by creating a spatial view in-database, loading the spatial view with readOGR, then dropping the view. I’ve created a function that does just that for PostGIS data sources.

readOgrSql = function (dsn, sql, ...) {

  require(rgdal)
  require(RPostgreSQL)
  require(stringr)

  # check dsn starts "PG:" and strip
  if (str_sub(dsn, 1, 3) != "PG:") {
    stop("readOgrSql only works with PostgreSQL DSNs")
  }
  dsnParamList = str_trim(str_split(dsn, ":")[[1]][2])

  # Build dbConnect expression, quote DSN parameter values 
  # if not already quoted
  if (str_count(dsnParamList, "=") 
      == str_count(dsnParamList, "='[[:alnum:]]+'")) {
    strExpression = str_c(
      "dbConnect(dbDriver('PostgreSQL'), ", 
      str_replace_all(dsnParamList, " ", ", "), 
      ")"
      )
  }
  else {
    dsnArgs = word(str_split(dsnParamList, " ")[[1]], 1, sep="=")
    dsnVals = sapply(
      word(str_split(dsnParamList, " ")[[1]], 2, sep="="), 
      function(x) str_c("'", str_replace_all(x, "'", ""), "'")
      )
    strExpression = str_c(
      "dbConnect(dbDriver('PostgreSQL'), ", 
      str_c(dsnArgs, "=", dsnVals, collapse=", "), 
      ")"
      )
  }

  # Connect, create spatial view, read spatial view, drop spatial view
  conn = eval(parse(text=strExpression))
  strCreateView = paste("CREATE VIEW vw_tmp_read_ogr AS", sql)
  dbSendQuery(conn, strCreateView)
  temp = readOGR(dsn = dsn, layer = "vw_tmp_read_ogr", ...)
  dbSendQuery(conn, "DROP VIEW vw_tmp_read_ogr;")
  dbDisconnect(conn)
  return(temp)
}

Since readOGR can’t send arbitrary SQL to the data source, the function also requires the RPostgreSQL package. Because I wanted the readOgrSql function to conform to readOGR as closely as possible, the function takes the DSN as a character string, then chops it up to pass the DSN parameters to dbConnect. readOGR expects the DSN in this format:

"PG:dbname='db' host='host' port='5432' user='user' password='***'"

The user requires privileges to create objects in the public schema.

Then, readOgrSql can be called as follows:

strSQL = "
  SELECT gid, name,
    ST_Transform(geom, 4326)::geometry(MultiPolygon, 4326) AS geom
  FROM geo_states"
spdfStates = readOgrSql(dsn, strSQL, stringsAsFactors=FALSE)
row.names(spdfStates) = spdfStates$name

(The last line is not strictly necessary, but is a little trick I use to make it easier to refer to specific features by name.)

At the moment this only works for PostGIS 2.0+. It relies on the fact that in recent versions of PostGIS, spatial views are automatically registered because the geometry_columns “table” is actually a view which finds all geometry columns in all tables. But for the view to work as a full spatial layer, the geometry column in the SELECT statement must be explicitly cast using a typmodded geometry. In the example, this is accomplished with ::geometry(MultiPolygon, 2263), but in general ::geometry(<geometry type>, <srid>). If there are no transformations on the geometry, the explicit cast can be avoided as long as the underlying table is defined using the PostGIS 2.0+ typmodded geometry columns:

SELECT gid, geom FROM my_spatial_table;

will work, but

SELECT gid, ST_Transform(geom, 4326) AS geom
FROM my_spatial_table;

will only work partially, because the transformed geometry is generic. See Manually Registering Geometry Columns in geometry_columns for details. Note that rgdal is smart enough to load the layer into an appropriate Spatial*DataFrame (e.g., SpatialPolygonsDataFrame for an OGC Polygon or MultiPolygon), but will lose the spatial reference information (i.e., the projection or coordinate system). It will plot OK, but unless you fix it in R, you won’t be able to combine it with layers in different coordinate reference systems.

Extending this to work for PostGIS pre-2.0 or for SpatiaLite shouldn’t be too difficult, but will require manually registering the spatial view in geometry_columns.