Today’s Adventure: Geospatial Data Wrangling with R! Using the sf package, let’s learn the ropes of handling geospatial data in R: from importing to projection, we’ve got you covered 💪 You might even learn a geoprocessing tip or two 😄
Before we even start, you might be wondering: why geospatial analytics? Why are we learning about it, and how is it relevant to a business or a governmental/academic institution?
A quick Google search brings up thousands upon thousands of answers, articles, and research papers: I recommend Deloitte’s 3-Minute Guide on why geospatial analytics matters and the value it brings to a business its data.
In today’s Hands-On Exercise, we’ll be doing an introduction to geospatial analytics in R:
The R packages we’ll be introducing today are: - sf: used for importing, managing, and processing geospatial data - tidyverse: used for importing, wrangling and visualising data (and other data science tasks!)
Note: Tidyverse consists of a family of R packages, such as readr, tidyr, and dplyr - these are used for the different steps of the data wrangling + visualisation process!
Here’s a handy code chunk that we’ll likely be putting at the start of every file: it (a) creates a list of packages, (b) checks if they have been installed (and installs it for us if they haven’t), and lastly (c) launches them in the Rstudio environment.
packages = c('sf', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The datasets used for this exercise are:
A good practice would be to put the data sets in a ‘data’ folder, sorted accordingly into ‘geospatial’ and ‘aspatial’ folders.
Now that we’ve got our data, let’s take a closer look at the formats they’re in and how to import them into R. The geospatial data we have are:
MP14_SUBZONE_WEB_PL
, a polygon feature layer in ESRI shapefile format,CyclingPath
, a line feature layer in ESRI shapefile format, andPreSchool
, a point feature layer in kml file format.To import, we’ll be using a handy function called st_read()
from the sf package. The arguments it takes in depends on the file format. For the shapefile format, two arguments are provided: dsn
to define the data path, and layer
to provide the shapefile name. Bonus: we don’t need to specify the file extension for shapefiles 😄 On the other hand, for the kml file format, our argument is the complete path with the kml file extension.
Dataset used: MP14_SUBZONE_WEB_PL
File format: shapefile Data frame type: polygon feature
mpsz = st_read(dsn = "data/geospatial",
layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source
`C:\IS415\IS415_msty\_posts\2021-08-30-hands-on-exercise-2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
From the output message, we can see that in our mpsz
simple feature data frame, there are 323 multipolygon features, 15 fields and is in the svy21 projected coordinates system.
Dataset used: CyclingPath
File format: shapefile Data frame type: line feature
cyclingpath = st_read(dsn = "data/geospatial",
layer = "CyclingPath")
Reading layer `CyclingPath' from data source
`C:\IS415\IS415_msty\_posts\2021-08-30-hands-on-exercise-2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3336 features and 2 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: 12831.45 ymin: 28347.98 xmax: 42799.89 ymax: 48948.15
Projected CRS: SVY21
From the output message, we can see that in our cyclingpath
linestring feature data frame, there are 1625 linestring features, 2 fields and is in the svy21 projected coordinates system.
Dataset used: pre-schools-location-kml
File format: kml Data frame type: point feature
preschool = st_read("data/geospatial/pre-schools-location-kml.kml")
Reading layer `PRESCHOOLS_LOCATION' from data source
`C:\IS415\IS415_msty\_posts\2021-08-30-hands-on-exercise-2\data\geospatial\pre-schools-location-kml.kml'
using driver `KML'
Simple feature collection with 1925 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6824 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
From the output message, we can see that in our preschool
point feature data frame, there are 1359 linestring features, 2 fields and is in the wgs84 projected coordinates system.
For aspatial data, such as the listings
Airbnb datset, there’s an extra step in the importing process. We’ll import it into a tibble data frame, then convert it into a simple feature data frame.
Since our listings
data set is in a csv file format, we’ll use the read_csv() function from the readr package, like so:
listings <- read_csv("data/aspatial/listings.csv")
glimpse(listings)
Rows: 4,252
Columns: 16
$ id <dbl> 50646, 71609, 71896, 71903, 2~
$ name <chr> "Pleasant Room along Bukit Ti~
$ host_id <dbl> 227796, 367042, 367042, 36704~
$ host_name <chr> "Sujatha", "Belinda", "Belind~
$ neighbourhood_group <chr> "Central Region", "East Regio~
$ neighbourhood <chr> "Bukit Timah", "Tampines", "T~
$ latitude <dbl> 1.33432, 1.34537, 1.34754, 1.~
$ longitude <dbl> 103.7852, 103.9589, 103.9596,~
$ room_type <chr> "Private room", "Private room~
$ price <dbl> 80, 178, 81, 81, 52, 40, 72, ~
$ minimum_nights <dbl> 90, 90, 90, 90, 14, 14, 90, 8~
$ number_of_reviews <dbl> 18, 20, 24, 48, 20, 13, 133, ~
$ last_review <date> 2014-07-08, 2019-12-28, 2014~
$ reviews_per_month <dbl> 0.22, 0.28, 0.33, 0.67, 0.20,~
$ calculated_host_listings_count <dbl> 1, 4, 4, 4, 50, 50, 7, 1, 50,~
$ availability_365 <dbl> 365, 365, 365, 365, 353, 364,~
From the output message, we can see that in our listing
tibble data frame, there are 4252 rows and 16 columns (not features and fields like in our simple data feature frame!) Take note of the latitude
and longitude
fields - we’ll be using them in the next phase.
Assumption: The data is in the wgs84 Geographic Coordinate System on account of its latitude/longtitude fields.
Now, let’s convert our listing
tibble data frame into a by using the st_as_sf() function from the sf package.
listings_sf <- st_as_sf(listings,
coords = c("longitude", "latitude"),
crs=4326) %>%
st_transform(crs = 3414)
Let’s explain the code chunk!
This gives us the new simple feature data frame, listings_sf
:
glimpse(listings_sf)
Rows: 4,252
Columns: 15
$ id <dbl> 50646, 71609, 71896, 71903, 2~
$ name <chr> "Pleasant Room along Bukit Ti~
$ host_id <dbl> 227796, 367042, 367042, 36704~
$ host_name <chr> "Sujatha", "Belinda", "Belind~
$ neighbourhood_group <chr> "Central Region", "East Regio~
$ neighbourhood <chr> "Bukit Timah", "Tampines", "T~
$ room_type <chr> "Private room", "Private room~
$ price <dbl> 80, 178, 81, 81, 52, 40, 72, ~
$ minimum_nights <dbl> 90, 90, 90, 90, 14, 14, 90, 8~
$ number_of_reviews <dbl> 18, 20, 24, 48, 20, 13, 133, ~
$ last_review <date> 2014-07-08, 2019-12-28, 2014~
$ reviews_per_month <dbl> 0.22, 0.28, 0.33, 0.67, 0.20,~
$ calculated_host_listings_count <dbl> 1, 4, 4, 4, 50, 50, 7, 1, 50,~
$ availability_365 <dbl> 365, 365, 365, 365, 353, 364,~
$ geometry <POINT [m]> POINT (22646.02 35167.9~
Note that a new column called
geometry
has been added! In addition,longtitude
andlatitude
have both been dropped.
Now that we’ve got our data frames, this begs the question: what exactly is inside them? Let’s learn how to retrieve the information of the dataframe with 3 simple methods!
Let’s say that we want a preliminary look at our data - just seeing the basic feature information is sufficient. In this case, st_geometry() is most appropriate:
st_geometry(mpsz)
Geometry set for 323 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
First 5 geometries:
As we can see, only the basic information of the feature class (type of geometry, geographic extent of features and CRS) is displayed.
However, basic information won’t take us very far. Let’s dig a little deeper and learn about the associated attribute information in the data frame. This is where glimpse() comes in:
glimpse(mpsz)
Rows: 323
Columns: 16
$ OBJECTID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15~
$ SUBZONE_NO <int> 1, 1, 3, 8, 3, 7, 9, 2, 13, 7, 12, 6, 1, 5, 1, 1,~
$ SUBZONE_N <chr> "MARINA SOUTH", "PEARL'S HILL", "BOAT QUAY", "HEN~
$ SUBZONE_C <chr> "MSSZ01", "OTSZ01", "SRSZ03", "BMSZ08", "BMSZ03",~
$ CA_IND <chr> "Y", "Y", "Y", "N", "N", "N", "N", "Y", "N", "N",~
$ PLN_AREA_N <chr> "MARINA SOUTH", "OUTRAM", "SINGAPORE RIVER", "BUK~
$ PLN_AREA_C <chr> "MS", "OT", "SR", "BM", "BM", "BM", "BM", "SR", "~
$ REGION_N <chr> "CENTRAL REGION", "CENTRAL REGION", "CENTRAL REGI~
$ REGION_C <chr> "CR", "CR", "CR", "CR", "CR", "CR", "CR", "CR", "~
$ INC_CRC <chr> "5ED7EB253F99252E", "8C7149B9EB32EEFC", "C35FEFF0~
$ FMEL_UPD_D <date> 2014-12-05, 2014-12-05, 2014-12-05, 2014-12-05, ~
$ X_ADDR <dbl> 31595.84, 28679.06, 29654.96, 26782.83, 26201.96,~
$ Y_ADDR <dbl> 29220.19, 29782.05, 29974.66, 29933.77, 30005.70,~
$ SHAPE_Leng <dbl> 5267.381, 3506.107, 1740.926, 3313.625, 2825.594,~
$ SHAPE_Area <dbl> 1630379.3, 559816.2, 160807.5, 595428.9, 387429.4~
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((31495.56 30..., MULT~
With glimpse(), we now know the data type of each field - for example, FMEL-UPD_D
field is of the date data type.
Sometimes, we want to dig as deep as we can and reveal all the information of a feature object. In this case, head() is best:
head(mpsz)
Simple feature collection with 6 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 24468.89 ymin: 28369.47 xmax: 32362.39 ymax: 30542.74
Projected CRS: SVY21
OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
1 1 1 MARINA SOUTH MSSZ01 Y MARINA SOUTH
2 2 1 PEARL'S HILL OTSZ01 Y OUTRAM
3 3 3 BOAT QUAY SRSZ03 Y SINGAPORE RIVER
4 4 8 HENDERSON HILL BMSZ08 N BUKIT MERAH
5 5 3 REDHILL BMSZ03 N BUKIT MERAH
6 6 7 ALEXANDRA HILL BMSZ07 N BUKIT MERAH
PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D
1 MS CENTRAL REGION CR 5ED7EB253F99252E 2014-12-05
2 OT CENTRAL REGION CR 8C7149B9EB32EEFC 2014-12-05
3 SR CENTRAL REGION CR C35FEFF02B13E0E5 2014-12-05
4 BM CENTRAL REGION CR 3775D82C5DDBEFBD 2014-12-05
5 BM CENTRAL REGION CR 85D9ABEF0A40678F 2014-12-05
6 BM CENTRAL REGION CR 9D286521EF5E3B59 2014-12-05
X_ADDR Y_ADDR SHAPE_Leng SHAPE_Area
1 31595.84 29220.19 5267.381 1630379.3
2 28679.06 29782.05 3506.107 559816.2
3 29654.96 29974.66 1740.926 160807.5
4 26782.83 29933.77 3313.625 595428.9
5 26201.96 30005.70 2825.594 387429.4
6 25358.82 29991.38 4428.913 1030378.8
geometry
1 MULTIPOLYGON (((31495.56 30...
2 MULTIPOLYGON (((29092.28 30...
3 MULTIPOLYGON (((29932.33 29...
4 MULTIPOLYGON (((27131.28 30...
5 MULTIPOLYGON (((26451.03 30...
6 MULTIPOLYGON (((25899.7 297...
Note that the ‘n’ argument allows us to select how many records to display!
Now that we’ve got our data, we get into the meat of the matter: visualisation! Having a dataframe of fields and data points is taxing for human eyes to process, but with a bit of plotting and mapping, our geospatial data becomes a gold mine of great insights. Let’s try to plot() it out:
plot(mpsz)
This is a handy multi-plot of all attributes (up to a reasonable amount). But maybe we just want to plot a specific attribute:
plot(mpsz["PLN_AREA_N"])
There are also times where we need just the geometry - in other words, the map outline:
plot(st_geometry(mpsz))
Projection transformation is when we project a simple feature data frame from one coordinate system to another. There are two common issues that require projection transformation: (a) missing or inaccurate coordinate system, or (b) inappropriate coordinate systems for the analysis chosen.
To tackle the first issue, we’ll need to assign an ESPG code to the data frame. Firstly, let’s check the coordinate system of our mpsz
data frame:
st_crs(mpsz)
Coordinate Reference System:
User input: SVY21
wkt:
PROJCRS["SVY21",
BASEGEOGCRS["SVY21[WGS84]",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ID["EPSG",6326]],
PRIMEM["Greenwich",0,
ANGLEUNIT["Degree",0.0174532925199433]]],
CONVERSION["unnamed",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
Although mpsz
data frame is projected in svy21, the output indicates that the EPSG is 9001. This is the wrong EPSG code! The correct EPSG code for svy21 should be 3414. Let’s change it:
mpsz3414 <- st_set_crs(mpsz, 3414)
st_crs(mpsz3414)
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Still remember our geospatial data overview in 3.0? You might have noticed that the coordinate system differed among our data frames: mpsz
and cyclingpath
are svy21, while preschool
is wgs84. preschool
might run into issues when we’re performing geoprocessing, because a geographic coordinate system is not appropriate if our analysis needs distance or/and area measurements.
preschool3414 <- st_transform(preschool,
crs = 3414)
st_geometry(preschool3414)
Geometry set for 1925 features
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 11203.01 ymin: 25596.33 xmax: 45404.24 ymax: 49300.88
z_range: zmin: 0 zmax: 0
Projected CRS: SVY21 / Singapore TM
First 5 geometries:
Notice the change to the svy21 projected coordinate system!
You might notice that the values in the bounding box have changed, too! They are now greater than the original 0-360 range of decimal degree commonly used by majority of the geographic coordinate systems.
Other than handling data, the handy sf package also has functions for geoprocessing (otherwise known as GIS analysis). Let’s explore some commonly-used geoprocessingn functions!
Let’s say that in a bid to promote healthier living, the authorities are planning to upgrade the existing cycling paths, making it safer and more appealing to cyclists all around! To do that, they’ll need to acquire 5 metres of reserved land on both sides of the cycling path. Our task is to find the total area of land that’s to be acquired.
Firstly, let’s compute the 5-meter buffers around the cycling paths:
buffer_cycling <- st_buffer(cyclingpath,
dist=5, nQuadSegs = 30)
Then, let’s actually calculate the area of the buffers:
buffer_cycling$AREA <- st_area(buffer_cycling)
Lastly, let’s find the total area with sum():
sum(buffer_cycling$AREA)
1642750 [m^2]
Let’s say that a preschool service group is organising a nation-wide event, and wants to find out the number of preschools in each Planning Subzone.
We can kill two birds with one stone with this handy combination of st_intersects() and length(). st_intersects() helps us identify the preschools located in each Planningn Subzone, while length() calculates the number of preschools per Planning Subzone.
mpsz3414$`PreSch Count`<- lengths(st_intersects(mpsz3414, preschool3414))
Let’s check a summary of PreSch Count
:
summary(mpsz3414$`PreSch Count`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 3.00 5.96 9.00 58.00
Now, the preschool service group wants to hold a subzone-specific event! They want to target the Planning Subzone with the most number of preschools to maximise their outreach, so let’s help them out with top_n():
top_n(mpsz3414, 1, `PreSch Count`)
Simple feature collection with 1 feature and 16 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 39655.33 ymin: 35966 xmax: 42940.57 ymax: 38622.37
Projected CRS: SVY21 / Singapore TM
OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
1 189 2 TAMPINES EAST TMSZ02 N TAMPINES
PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D
1 TM EAST REGION ER 21658EAAF84F4D8D 2014-12-05
X_ADDR Y_ADDR SHAPE_Leng SHAPE_Area
1 41122.55 37392.39 10180.62 4339824
geometry PreSch Count
1 MULTIPOLYGON (((42196.76 38... 58
Lastly, the preschool service group wants to know the density of preschools by planning subzone for future events. We’ll need to derive the area of each planning subzone:
mpsz3414$Area <- mpsz3414 %>%
st_area()
Now, introducing a new function, mutate()! It’s to help us add the density column into mpsz3414
:
mpsz3414 <- mpsz3414 %>%
mutate(`PreSch Density` = `PreSch Count`/Area * 1000000)
While they look similar and have a similar function (to add new variables), there’s a difference between mutate() and transmute()! mutate() adds new variables and preserves existing ones while transmute() adds new variables and drops existing ones. Use the appropriate one for the situation!
Why do we do EDA? Here’s a great Medium article that introduces EDA: from its significance to its components, it’s got you covered!
For this EDA section, we’ll be introducing various ggplot2 functions that’ll help us create functional and yet truthful statistical graphs that’ll help us visualise and understand our data!
Firstly, let’s use hist() to plot a histogram and reveal the distribution of PreSch Density
:
hist(mpsz3414$`PreSch Density`)
That looks good - but it’s not something you’ll publish in your reports! Let’s add a little customisation:
ggplot(data=mpsz3414,
aes(x= as.numeric(`PreSch Density`)))+
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(title = "Are pre-school even distributed in Singapore?",
subtitle= "There are many planning sub-zones with a single pre-school, on the other hand, \nthere are two planning sub-zones with at least 20 pre-schools",
x = "Pre-school density (per km sq)",
y = "Frequency")
DIY: Using ggplot2, plot a scatterplot showing the relationship between Pre-school Density and Pre-school Count.
Let’s approach this step-by-step. Firstly, we’ll need the variables Pre-school Density and Pre-school Count. Luckily, we’ve already created both in 2.5.2 Point-In-Polygon Count.
Next, we need to plot a scatterplot. For our histogram, we used geom_histogram - and for a scatterplot, we’ll use geom_point. We can further customise this by adjusting the colour, size and fill of the points, like so:
ggplot(data=mpsz3414,
aes(y = `PreSch Count`,
x= as.numeric(`PreSch Density`)))+
geom_point(color="black",
fill="light blue") +
labs(title = "",
x = "Pre-school density (per km sq)",
y = "Pre-school count")
Seem finished? There’s actually one more step: ensuring that your data is presented as fairly as possible - we don’t want to mislead our readers 😱 In this case, notice that the aspect ratio of our graph is off - its height is greater than its width! Let’s correct that by setting limiters with the xlim() and ylim() functions inside ggplot2:
And there we have it! A representative scatterplot visualisation ✨
This is the end of this week’s exercise! I’m really excited to learn more about geospatial analytics: we’ve just begun to scratch the surface, and there’s so much more to explore and experiment with!
On a related sidenote: I’ve learned a number of things myself when creating this blog post! For example, I wanted to have a table of contents, but googling ‘sidebar navigation pane’ proved to be far less effective than just reading the Distill Basics - which is a lesson learnt on reading the provided materials!
On a funny note, here’s a mini-comic on my experience with tabsets:
Tabsets are meant for html documents, but this output is a distill article… Of course it didn’t work 😠Well, that’s just another learning point in our Geospatial Analytics Journey !
Walk with me as we dive deeper the coming weeks! 🤗 💖 ✨
P.S. I found my old tablet so I could illustrate the points better! Handwriting will be more consistent from this blog post onwards 😄