This analysis aims to visualise and understand the spatio-temporal patterns of COVID-19 cases in DKI Jakarta and its sub-districts.
On 2nd March 2020, alarm bells rang across Indonesia: newspapers countrywide printed the same headline, radio stations blared, and worried eyes turned to a televised statement from Preisdent Joko Widodo.
He had just confirmed the first two cases of COVID-19 in the country.
This would mark the start of COVID-19’s influence on Indonesia, spanning over 4 million confirmed cases and over 130,000 reported cumulative deaths to date of writing, and marking Indonesia as the country with the highest number of report COVID-19 cases in Southeast Asia. A quick look at Indonesia’s official COVID-19 Distribution Map introduces a statistical overview and an approximate geographical estimation of the hardest-hit areas. Yet the breadth of its content belies a level of generalisation - which makes it hard to do analysis on a finer-grained level.
Why does this matter? Out of the 34 provinces in Indonesia, DKI Jakarta made up close to 24% of cumulative confirmed cases - a shocking statistic that bears looking into. As such, our goal is to dive deeper into sub-district levels of data and visualise the spatio-temporal patterns in DKI Jakarta.
This analysis aims to visualise and understand the spatio-temporal patterns of COVID-19 cases in DKI Jakarta and its sub-districts, with our goals being:
The R packages we’ll use for this analysis are:
In addition, the following tidyverse packages will be used:
# initialise a list of required packages
# note that 'readxl' is in the list of packages despite being part of tidyverse: as readxl is not a core tidyverse package, it needs to be loaded explicitly
packages = c('plyr', 'sf', 'tidyverse', 'readxl', 'tmap', 'maptools', 'kableExtra')
# for each package, check if installed and if not, install it
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
# initialise a dataframe of our geospatial and aspatial data details
datasets <- data.frame(
Type=c("Geospatial",
"Aspatial"),
Name=c("[Batas Desa Provinsi DKI Jakarta](https://www.indonesia-geospasial.com/2020/04/download-shapefile-shp-batas-desa.html)",
"[Standar Kelurahan Data Corona (Monthly)](https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/)"),
Format=c("Shapefile",
".xlsx"),
Description=c("Sub-districts in DKI Jakarta",
"Sub-district level data of daily COVID-19 cases in DKI Jakarta
between March 2020~July 2021")
)
# with reference to this guide on kableExtra:
# https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
# kable_material is the name of the kable theme
# 'hover' for to highlight row when hovering, 'scale_down' to adjust table to fit page width
library(knitr)
library(kableExtra)
kable(head(datasets), caption="Datasets Used") %>%
kable_material("hover", latex_options="scale_down")
Type | Name | Format | Description |
---|---|---|---|
Geospatial | Batas Desa Provinsi DKI Jakarta | Shapefile | Sub-districts in DKI Jakarta |
Aspatial | Standar Kelurahan Data Corona (Monthly) | .xlsx |
Sub-district level data of daily COVID-19 cases in DKI Jakarta between March 2020~July 2021 |
To retrieve the monthly cumulative records for the COVID-19 cases in Jakarta, we took the data compiled on the last day of each month, with an exception: there was a broken link for the 31st January 2021 data file, hence the 30th January 2021 data file was used instead.
# bd stands for "batas desa", translated as "village boundary"
# reads in geospatial data and stores into bd_jakarta dataframe
bd_jakarta <- st_read(dsn="data/geospatial",
layer="BATAS_DESA_DESEMBER_2019_DUKCAPIL_DKI_JAKARTA")
Reading layer `BATAS_DESA_DESEMBER_2019_DUKCAPIL_DKI_JAKARTA' from data source `C:\IS415\IS415_msty\_posts\2021-09-10-take-home-exercise-1\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 269 features and 161 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 106.3831 ymin: -6.370815 xmax: 106.9728 ymax: -5.184322
Geodetic CRS: WGS 84
From the output message, we learn that:
Before we even start visualising our data, we have to first check for two things: invalid geometries and missing values, which could impact future calculations and representations.
Reference was taken from the senior sample submissions for the code for this section, with credit to Xiao Rong Wong’s Geographic Analysis of the Supply & Demand of Childcare Services in Singapore
Let us first check for invalid geometries:
# function breakdown:
# the st_is_valid function checks whether a geometry is valid
# which returns the indices of certain values based on logical conditions
# length returns the length of data objects
# checks for the number of geometries that are NOT valid
length(which(st_is_valid(bd_jakarta) == FALSE))
[1] 0
# Alternative Method
# test <- st_is_valid(bd_jakarta,reason=TRUE)
# length(which(test!= "Valid Geometry"))
# credit to Rajiv Abraham Xavier https://rpubs.com/rax/Take_Home_Ex01
There are no invalid geometries.
Now, let us check for missing values:
Simple feature collection with 2 features and 161 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 106.8412 ymin: -6.154036 xmax: 106.8612 ymax: -6.144973
Geodetic CRS: WGS 84
OBJECT_ID KODE_DESA DESA KODE PROVINSI KAB_KOTA
243 25645 31888888 DANAU SUNTER 318888 DKI JAKARTA <NA>
244 25646 31888888 DANAU SUNTER DLL 318888 DKI JAKARTA <NA>
KECAMATAN DESA_KELUR JUMLAH_PEN JUMLAH_KK LUAS_WILAY KEPADATAN
243 <NA> <NA> 0 0 0 0
244 <NA> <NA> 0 0 0 0
PERPINDAHA JUMLAH_MEN PERUBAHAN WAJIB_KTP SILAM KRISTEN KHATOLIK
243 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0
HINDU BUDHA KONGHUCU KEPERCAYAA PRIA WANITA BELUM_KAWI KAWIN
243 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0
CERAI_HIDU CERAI_MATI U0 U5 U10 U15 U20 U25 U30 U35 U40 U45 U50
243 0 0 0 0 0 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0 0 0 0 0 0
U55 U60 U65 U70 U75 TIDAK_BELU BELUM_TAMA TAMAT_SD SLTP SLTA
243 0 0 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0 0 0
DIPLOMA_I DIPLOMA_II DIPLOMA_IV STRATA_II STRATA_III BELUM_TIDA
243 0 0 0 0 0 0
244 0 0 0 0 0 0
APARATUR_P TENAGA_PEN WIRASWASTA PERTANIAN NELAYAN AGAMA_DAN
243 0 0 0 0 0 0
244 0 0 0 0 0 0
PELAJAR_MA TENAGA_KES PENSIUNAN LAINNYA GENERATED KODE_DES_1
243 0 0 0 0 <NA> <NA>
244 0 0 0 0 <NA> <NA>
BELUM_ MENGUR_ PELAJAR_ PENSIUNA_1 PEGAWAI_ TENTARA KEPOLISIAN
243 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0
PERDAG_ PETANI PETERN_ NELAYAN_1 INDUSTR_ KONSTR_ TRANSP_ KARYAW_
243 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0
KARYAW1 KARYAW1_1 KARYAW1_12 BURUH BURUH_ BURUH1 BURUH1_1
243 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0
PEMBANT_ TUKANG TUKANG_1 TUKANG_12 TUKANG__13 TUKANG__14
243 0 0 0 0 0 0
244 0 0 0 0 0 0
TUKANG__15 TUKANG__16 TUKANG__17 PENATA PENATA_ PENATA1_1 MEKANIK
243 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0
SENIMAN_ TABIB PARAJI_ PERANCA_ PENTER_ IMAM_M PENDETA PASTOR
243 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0
WARTAWAN USTADZ JURU_M PROMOT ANGGOTA_ ANGGOTA1 ANGGOTA1_1
243 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0
PRESIDEN WAKIL_PRES ANGGOTA1_2 ANGGOTA1_3 DUTA_B GUBERNUR
243 0 0 0 0 0 0
244 0 0 0 0 0 0
WAKIL_GUBE BUPATI WAKIL_BUPA WALIKOTA WAKIL_WALI ANGGOTA1_4
243 0 0 0 0 0 0
244 0 0 0 0 0 0
ANGGOTA1_5 DOSEN GURU PILOT PENGACARA_ NOTARIS ARSITEK AKUNTA_
243 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0
KONSUL_ DOKTER BIDAN PERAWAT APOTEK_ PSIKIATER PENYIA_ PENYIA1
243 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0
PELAUT PENELITI SOPIR PIALAN PARANORMAL PEDAGA_ PERANG_ KEPALA_
243 0 0 0 0 0 0 0 0
244 0 0 0 0 0 0 0 0
BIARAW_ WIRASWAST_ LAINNYA_12 LUAS_DESA KODE_DES_3 DESA_KEL_1
243 0 0 0 0 <NA> <NA>
244 0 0 0 0 <NA> <NA>
KODE_12 geometry
243 0 MULTIPOLYGON (((106.8612 -6...
244 0 MULTIPOLYGON (((106.8504 -6...
That’s a long output! There are too many columns, and it’ll clutter up future analysis - which we’ll take care of in 3.5. As we can see, there are two particular rows with missing values for KAB_KOTA
(City), KECAMATAN
(District) and DESA_KELUR
(Village), as well as a few other fields - specifically, OBJECT_ID 25645 and 25646. We’ll remove them:
Let us check the CRS of bd_jarkarta
:
# retrieves the coordinate system of bd_jakarta
st_crs(bd_jakarta)
Coordinate Reference System:
User input: WGS 84
wkt:
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]]
As we can see, the assigned coordinates system is WGS 84, the ‘World Geodetic System 1984’. In the context of this dataset, this isn’t appropriate: as this is an Indonesian-specific geospatial dataset, we should be using the national CRS of Indonesia, DGN95, the ‘Datum Geodesi Nasional 1995’, ESPG code 23845. Let’s rectify that:
# transforms the CRS to DGN95, ESPG code 23845
bd_jakarta <- st_transform(bd_jakarta, 23845)
Let us check if the CRS has been properly assigned:
st_crs(bd_jakarta)
Coordinate Reference System:
User input: EPSG:23845
wkt:
PROJCRS["DGN95 / Indonesia TM-3 zone 54.1",
BASEGEOGCRS["DGN95",
DATUM["Datum Geodesi Nasional 1995",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4755]],
CONVERSION["Indonesia TM-3 zone 54.1",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",139.5,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9999,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",200000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",1500000,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre."],
AREA["Indonesia - onshore east of 138°E."],
BBOX[-9.19,138,-1.49,141.01]],
ID["EPSG",23845]]
Success! 😄
We’ve finished our standard pre-processing! Let’s quickly visualise the data…
# plots the geometry only
# alternatively, we can use plot(bd_jakarta$geometry)
plot(st_geometry(bd_jakarta))
As we can see, bd_jakarta
includes both the mainland and the outer islands. As per the assignment requirements, since the outer islands aren’t relevant to our analysis, we’ll have to remove them.
There are a few ways we can do this - but the best way would be to explore our data. From 3.2.2, we saw an output of all the fields - and drew attention to KAB_KOTA
(City), KECAMATAN
(District) and DESA_KELUR
(Village).
Of these three, KAB_KOTA would be the logical choice: it’s the coarsest-grained level of distinction. Let’s check for its unique values:
# ouputs unique values of the KAB_KOTA field
unique(bd_jakarta$"KAB_KOTA")
[1] "JAKARTA BARAT" "JAKARTA PUSAT" "KEPULAUAN SERIBU"
[4] "JAKARTA UTARA" "JAKARTA TIMUR" "JAKARTA SELATAN"
As we can see, all cities within Jakarta have a JAKARTA
prefix, while KEPULAUAN SERIBU
(translated to ‘Thousand Islands’) refers to the outer islands. Just to check, we can visualise KAB_KOTA
:
# with bd_jakarta as the input data (setting the 'base')
# draw the KAB_KOTA (city) polygons
# essentially shades the map according to the city divisions
tm_shape(bd_jakarta) +
tm_polygons("KAB_KOTA")
Now that we know how to identify the outer islands, it’s time to remove them:
# filters out the outer islands by accepting only if the value of KAB_KOTA is NOT KEPULAUAN SERIBU
bd_jakarta <- filter(bd_jakarta, KAB_KOTA != "KEPULAUAN SERIBU")
bd_jakarta
In addition, the assignment requires us to retain the only the fields relevant to our analysis - in other words, the first 9 fields in the bd_jakarta
data frame.
# filters out other fields by accepting only the first 9 fields
bd_jakarta <- bd_jakarta[, 0:9]
Lastly, let’s translate the column names of bd_jakarta
into English for ease of comprehension:
# with reference to: https://www.codegrepper.com/code-examples/r/rename+column+name+in+r
# renames the columns in the style New_Name = OLD_NAME
bd_jakarta <- bd_jakarta %>%
dplyr::rename(
Object_ID=OBJECT_ID,
Province=PROVINSI,
City=KAB_KOTA,
District=KECAMATAN,
Village_Code=KODE_DESA,
Village=DESA,
Sub_District=DESA_KELUR,
Code=KODE,
Total_Population=JUMLAH_PEN
)
Before we move on into importing the aspatial data and into the meat of our geovisualiastion, it’s important to get a feel for the data that we’re working with, especially on the sub-district level. As such, let’s take a quick glimpse:
# reveals the data type of all fields + some values
glimpse(bd_jakarta)
Rows: 261
Columns: 10
$ Object_ID <dbl> 25477, 25478, 25397, 25400, 25390, 25391, 2~
$ Village_Code <chr> "3173031006", "3173031007", "3171031003", "~
$ Village <chr> "KEAGUNGAN", "GLODOK", "HARAPAN MULIA", "CE~
$ Code <dbl> 317303, 317303, 317103, 317103, 317102, 317~
$ Province <chr> "DKI JAKARTA", "DKI JAKARTA", "DKI JAKARTA"~
$ City <chr> "JAKARTA BARAT", "JAKARTA BARAT", "JAKARTA ~
$ District <chr> "TAMAN SARI", "TAMAN SARI", "KEMAYORAN", "K~
$ Sub_District <chr> "KEAGUNGAN", "GLODOK", "HARAPAN MULIA", "CE~
$ Total_Population <dbl> 21609, 9069, 29085, 41913, 15793, 33383, 35~
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((-3626874 69...~
[1] 261
[1] 42
Note that there are 261 unique sub-districts, and 42 unique distrcits. The max number of categories for mapping with tmap is 30 - and even though we can adjust the max.categories in tmap_options, 261 segmented sections (and even 42 sections) on a single map is too fragmented and minute for human eyes. As such, the only level which is human-readable is the ‘City’ level:
# shades the map according to the city divisions
tm_shape(bd_jakarta) +
tm_polygons("City")
In our ‘data/aspatial’ folder, we have multiple .xlsx files ranging from 30 March 2020 to July 2021. However, before we start compiling all of our data, it’s important to understand what we’re working with and to check for any discrepancies, so let’s perform a brief EDA:
# reads the 31st March 2020 .xlsx aspatial data and stores it as mar2020
mar2020 <- read_xlsx("data/aspatial/Standar Kelurahan Data Corona (31 Maret 2020 Pukul 08.00).xlsx")
glimpse(mar2020)
Rows: 270
Columns: 18
$ ID_KEL...1 <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ ID_KEL...2 <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ Nama_provinsi <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ nama_kota <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ nama_kecamatan <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ nama_kelurahan <chr> "TOTAL", "BELUM DIKETAHUI", "LUAR DKI J~
$ ODP <dbl> 2372, 1005, 274, 56, 37, 26, 24, 17, 17~
$ `Proses Pemantauan` <dbl> 499, 94, 63, 50, 0, 14, 9, 14, 15, 11, ~
$ `Selesai Pemantauan` <dbl> 1873, 911, 211, 6, 37, 12, 15, 3, 2, 5,~
$ PDP <dbl> 1189, 213, 176, 3, 6, 10, 2, 8, 2, 9, 6~
$ `Masih Dirawat` <dbl> 822, 126, 123, 3, 5, 9, 2, 8, 2, 8, 6, ~
$ `Pulang dan Sehat` <dbl> 367, 87, 53, 0, 1, 1, 0, 0, 0, 1, 0, 4,~
$ POSITIF <dbl> 794, 154, 106, 4, 3, 6, 2, 13, 0, 8, 0,~
$ Dirawat <dbl> 481, 130, 62, 3, 1, 3, 2, 12, 0, 5, 0, ~
$ Sembuh <dbl> 51, 2, 5, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,~
$ Meninggal <dbl> 88, 8, 10, 0, 2, 1, 0, 0, 0, 1, 0, 1, 0~
$ `Self Isolation` <dbl> 174, 14, 29, 0, 0, 2, 0, 1, 0, 2, 0, 1,~
$ Keterangan <dbl> 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
glimpse() reveals to us that there is a duplicate ID_KEL
column - which our read_xlsx() function then auto-assigns with a ‘…n’, like so:
This duplication extends from March 2020 to June 2020, after which another duplicate occurs in July 2020:
# reads the 31st July 2020 .xlsx aspatial data and stores it as jul2020
jul2020 <- read_xlsx("data/aspatial/Standar Kelurahan Data Corona (31 Juli 2020 Pukul 09.00).xlsx")
glimpse(jul2020)
Rows: 270
Columns: 28
$ ID_KEL <chr> NA, "3172051003", "3173041007", "317~
$ Nama_provinsi <chr> NA, "DKI JAKARTA", "DKI JAKARTA", "D~
$ nama_kota <chr> NA, "JAKARTA UTARA", "JAKARTA BARAT"~
$ nama_kecamatan <chr> NA, "PADEMANGAN", "TAMBORA", "KRAMAT~
$ nama_kelurahan <chr> "TOTAL", "ANCOL", "ANGKE", "BALE KAM~
$ SUSPEK <dbl> 58603, 164, 202, 74, 34, 111, 91, 63~
$ `Perawatan RS` <dbl> 1617, 8, 4, 9, 0, 0, 1, 1, 6, 0, 10,~
$ `Isolasi di Rumah...8` <dbl> 2229, 28, 7, 8, 0, 1, 0, 0, 11, 0, 1~
$ `Suspek Meninggal` <dbl> 2226, 0, 2, 1, 1, 1, 2, 1, 0, 1626, ~
$ `Selesai Isolasi...10` <dbl> 52531, 128, 189, 56, 33, 109, 88, 61~
$ PROBABLE <dbl> 628, 0, 0, 0, 0, 0, 0, 0, 1, 587, 0,~
$ `Probable Meninggal` <dbl> 590, 0, 0, 0, 0, 0, 0, 0, 1, 586, 0,~
$ `Selesai Isolasi...13` <dbl> 38, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,~
$ `PELAKU PERJALANAN` <dbl> 2004, 7, 0, 3, 0, 2, 18, 2, 1, 116, ~
$ `Isolasi di Rumah...15` <dbl> 134, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
$ `Selesai Isolasi...16` <dbl> 1870, 5, 0, 2, 0, 2, 18, 2, 1, 116, ~
$ `KONTAK ERAT` <dbl> 96920, 311, 296, 177, 32, 146, 376, ~
$ `Isolasi di Rumah...18` <dbl> 10389, 34, 16, 81, 1, 26, 7, 11, 112~
$ `Selesai Isolasi...19` <dbl> 86531, 277, 280, 96, 31, 120, 369, 6~
$ DISCARDED <dbl> 5818, 4, 22, 22, 0, 14, 6, 7, 26, 87~
$ Meninggal...21 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
$ `Selesai Isolasi...22` <dbl> 5817, 4, 22, 22, 0, 14, 6, 7, 26, 86~
$ POSITIF <dbl> 21201, 46, 49, 31, 7, 45, 30, 16, 42~
$ Dirawat <dbl> 2183, 5, 6, 3, 1, 3, 5, 1, 0, 1209, ~
$ Sembuh <dbl> 13208, 38, 36, 21, 5, 21, 23, 13, 31~
$ Meninggal...26 <dbl> 836, 0, 2, 1, 1, 0, 2, 0, 3, 88, 2, ~
$ `Self Isolation` <dbl> 4974, 3, 5, 6, 0, 21, 0, 2, 8, 2396,~
$ Keterangan <dbl> 0, NA, NA, NA, NA, NA, NA, NA, NA, N~
This time, there are duplicates of the Isolasi di Rumah
, Selasi Isolasi
and Meninggal
columns, and this duplication extends till August 2021. While this might seem concerning, we’ll address this in the following sections.
Firstly, let us list down the requirements for our aspatial data:
Meninggal
column we should keep the LAST duplicate, which properly reflects the cumulative death casesOur columns of interest (those that are relevant to our analysis) are as follows:
Now that we know our requirements, we can do this step-by-step: importing all the files into one df, retaining the necessary columns (and deleting duplicate columns), then adding the date
column. Alternatively, we can combine all of these into a function!
# takes in an aspatial data filepath and returns a processed output
aspatial_preprocess <- function(filepath){
# the .name_repair="minimal" is to indicate not to replace duplicate column names with 'NAME...n' like we saw above!
# reference: https://readxl.tidyverse.org/articles/articles/column-names.html
result_file <- read_xlsx(filepath, .name_repair="minimal")
# Remove Duplicate Columns
# essentially, it's saying "take the columns that aren't duplicated, and if there are duplicates, take the last duplicate"
# we use fromLast = TRUE to ensure that we keep the last duplicated column
# reference: https://www.marsja.se/how-to-remove-duplicates-in-r-rows-columns-dplyr/
# reference: https://stackoverflow.com/questions/16905425/find-duplicate-values-in-r
result_file <- result_file[, !duplicated(colnames(result_file), fromLast = TRUE)]
# Create the Date Column
# the format of our files is: Standar Kelurahan Data Corona (DD Month YYYY Pukul Ti.me)
# while the start is technically "(", "(" is part of a regular expression and leads to a warning message, so we'll use "Corona" instead. The [[1]] refers to the first element in the list.
# we're loading it as DD-Month-YYYY format
# the end is 1 space before "Pukul", which means we have to -2 spaces (one for P, one for space)
# as such, the most relevant functions are substr (returns a substring) and either str_locate (returns location of substring as an integer matrix) or gregexpr (returns a list of locations of substring)
# reference https://stackoverflow.com/questions/14249562/find-the-location-of-a-character-in-string
startpoint <- gregexpr(pattern="Corona", filepath)[[1]] + 8
endpoint <- gregexpr(pattern="Pukul", filepath)[[1]] - 2
result_file$Date <- substr(filepath, startpoint, endpoint)
# Retain the Relevant Columns
result_file <- result_file %>%
select("Date",
"ID_KEL",
"Nama_provinsi",
"nama_kota",
"nama_kecamatan",
"nama_kelurahan",
"POSITIF",
"Meninggal")
return(result_file)
}
aspatial_preprocess
functionNow that we have our custom function to preprocess aspatial data, we need to feed file into it. There’s the option of doing it manually, inputting our file names line by line - but with a handy function called list.files() and lapply() (that applies a function to all elements in a list), the process can be cut down to a few lines!
# in the folder 'data/aspatial', find files with the extension '.xlsx' and add it to our fileslist
# the full.names=TRUE prepends the directory path to the file names, giving a relative file path - otherwise, only the file names (not the paths) would be returned
# reference: https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.files.html
fileslist <-list.files(path = "data/aspatial", pattern = "*.xlsx", full.names=TRUE)
# afterwards, for every element in fileslist, apply aspatial_process function
dflist <- lapply(seq_along(fileslist), function(x) aspatial_preprocess(fileslist[x]))
Lastly, we’ll need to convert the dflist
into an actual dataframe with ldply(), like so:
cases_jakarta <- ldply(dflist, data.frame)
Let’s check what cases_jakarta
looks like, and make sure the columns are correct:
glimpse(cases_jakarta)
Rows: 4,590
Columns: 8
$ Date <chr> "28 Februari 2021", "28 Februari 2021", "28 F~
$ ID_KEL <chr> NA, "3172051003", "3173041007", "3175041005",~
$ Nama_provinsi <chr> NA, "DKI JAKARTA", "DKI JAKARTA", "DKI JAKART~
$ nama_kota <chr> NA, "JAKARTA UTARA", "JAKARTA BARAT", "JAKART~
$ nama_kecamatan <chr> NA, "PADEMANGAN", "TAMBORA", "KRAMAT JATI", "~
$ nama_kelurahan <chr> "TOTAL", "ANCOL", "ANGKE", "BALE KAMBANG", "B~
$ POSITIF <dbl> 339735, 834, 617, 755, 358, 870, 811, 1038, 1~
$ Meninggal <dbl> 5478, 9, 8, 15, 8, 13, 10, 10, 25, 12, 22, 30~
Since the values in the Date
column were derived from substrings, they’re naturally in string format. We should convert that into datetime, keeping in mind that the values in Date
are in Bahasa Indonesia
# parses the 'Date' column into Month(Full Name)-YYYY datetime objects
# reference: https://stackoverflow.com/questions/53380650/b-y-date-conversion-gives-na
# locale="ind" means that the locale has been set as Indonesia
Sys.setlocale(locale="ind")
[1] "LC_COLLATE=Indonesian_Indonesia.1252;LC_CTYPE=Indonesian_Indonesia.1252;LC_MONETARY=Indonesian_Indonesia.1252;LC_NUMERIC=C;LC_TIME=Indonesian_Indonesia.1252"
Rows: 4,590
Columns: 8
$ Date <date> 2021-02-28, 2021-02-28, 2021-02-28, 2021-02-~
$ ID_KEL <chr> NA, "3172051003", "3173041007", "3175041005",~
$ Nama_provinsi <chr> NA, "DKI JAKARTA", "DKI JAKARTA", "DKI JAKART~
$ nama_kota <chr> NA, "JAKARTA UTARA", "JAKARTA BARAT", "JAKART~
$ nama_kecamatan <chr> NA, "PADEMANGAN", "TAMBORA", "KRAMAT JATI", "~
$ nama_kelurahan <chr> "TOTAL", "ANCOL", "ANGKE", "BALE KAMBANG", "B~
$ POSITIF <dbl> 339735, 834, 617, 755, 358, 870, 811, 1038, 1~
$ Meninggal <dbl> 5478, 9, 8, 15, 8, 13, 10, 10, 25, 12, 22, 30~
Just like what we did in 3.6, we’ll translate the column names to English for ease of comprehension:
# renames the columns in the style New_Name = OLD_NAME
cases_jakarta <- cases_jakarta %>%
dplyr::rename(
Date=Date,
ID=ID_KEL,
Province=Nama_provinsi,
City=nama_kota,
District=nama_kecamatan,
Sub_District=nama_kelurahan,
Cases=POSITIF,
Deaths=Meninggal
)
Now that we have our confirmed dataframe, let’s execute any pre-processing we might have missed. Firstly, let’s check for missing values:
Date ID Province City District Sub_District Cases
1 2021-02-28 <NA> <NA> <NA> <NA> TOTAL 339735
271 2020-04-30 <NA> <NA> <NA> <NA> TOTAL 4138
541 2021-04-30 <NA> <NA> <NA> <NA> TOTAL 408620
811 2021-01-30 <NA> <NA> <NA> <NA> TOTAL 266244
1081 2020-06-30 <NA> <NA> <NA> <NA> TOTAL 11276
1351 2021-06-30 <NA> <NA> <NA> <NA> TOTAL 543468
1621 2020-11-30 <NA> <NA> <NA> <NA> TOTAL 136861
1891 2020-09-30 <NA> <NA> <NA> <NA> TOTAL 74368
2161 2020-08-31 <NA> <NA> <NA> <NA> <NA> 40309
2431 2020-12-31 <NA> <NA> <NA> <NA> TOTAL 183735
2701 2020-07-31 <NA> <NA> <NA> <NA> TOTAL 21201
2971 2021-07-31 <NA> <NA> <NA> <NA> TOTAL 814653
3241 2020-03-31 <NA> <NA> <NA> <NA> TOTAL 794
3511 2021-03-31 <NA> <NA> <NA> <NA> TOTAL 382055
3781 2020-05-31 <NA> <NA> <NA> <NA> TOTAL 7272
4051 2021-05-31 <NA> <NA> <NA> <NA> TOTAL 430059
4321 2020-10-31 <NA> <NA> <NA> <NA> TOTAL 105597
Deaths
1 5478
271 381
541 6733
811 4254
1081 641
1351 8486
1621 2671
1891 1731
2161 1202
2431 3287
2701 836
2971 12135
3241 88
3511 6341
3781 520
4051 7327
4321 2255
Since those rows have missing values from their ID all the way to their sub-districts, they should be removed accordingly:
In addition, a brief inspection of cases_jakarta
reveals abnormal values for the ID
column: specifically, BELUM DIKETAHUI, PROSES UPDATE DATA and LUAR DKI JAKARTA. Let’s remove them:
# removes the rows where ID is BELUM DIKETAHUI, PROSES UPDATE DATA and LUAR DKI JAKARTA
cases_jakarta <- cases_jakarta[!(cases_jakarta$ID=="BELUM DIKETAHUI" |
cases_jakarta$ID=="PROSES UPDATE DATA" |
cases_jakarta$ID=="LUAR DKI JAKARTA") ,]
Lastly, we want to remove any records from the Outer Islands where the City is, like these:
# filters out the outer islands by accepting only if the value of City is NOT KAB.ADM.KEP.SERIBU
cases_jakarta <- filter(cases_jakarta, City != "KAB.ADM.KEP.SERIBU")
With that, we’re done with the data importing and pre-processing! Let’s move on into the next section 💪
Now that we have both the geospatial and aspatial data frames, we’ll need to join them. A quick look at their headers tell us what their common fields are:
# checks for column names of the dataframes
colnames(bd_jakarta)
[1] "Object_ID" "Village_Code" "Village"
[4] "Code" "Province" "City"
[7] "District" "Sub_District" "Total_Population"
[10] "geometry"
colnames(cases_jakarta)
[1] "Date" "ID" "Province" "City"
[5] "District" "Sub_District" "Cases" "Deaths"
It seems that the Province, City, Distict and Sub_District should match up. Let’s try doing that first:
# joins cases_jakarta to bd_jakarta based on Province, Sub_District and City
combined_jakarta <- left_join(bd_jakarta, cases_jakarta,
by=c(
"Province"="Province",
"Sub_District"="Sub_District",
"City"="City")
)
Now, let’s visualise our current combined_jakarta
in terms of cases and deaths:
# maps the Cases and Deaths side-by-side
prelim_cases = tm_shape(combined_jakarta)+
tm_fill("Cases") +
tm_borders(alpha = 0.5) +
tm_layout(main.title="Preliminary Cases Count")
prelim_deaths = tm_shape(combined_jakarta)+
tm_fill("Deaths") +
tm_borders(alpha = 0.5) +
tm_layout(main.title="Preliminary Death Count")
tmap_arrange(prelim_cases, prelim_deaths)
Hmm, that’s not quite right. There are still some ‘missing’ values, but we’ve taken care to remove NA records from both sets of data. The most likely reason for this is due to mistmatched records (one says Berenstein while the other says Berenstain, for example) - which requires addition tweaking and investigation.
A brief comparison of the geospatial vs aspatial values reveals to us the mismatch: for the Sub_District
field, there isn’t standardisation of the sub-district names, which leads to the same sub-district being referred to by slightly different names, such as KRAMAJATI
vs KRAMAT JATI
. This is why these records appear as ‘missing values’. To find all of these discrepancies, we should look for sub-district names that appear in our cases_jakarta
that don’t appear in bd_jakarta
, like so:
[1] "BALE KAMBANG" "HALIM PERDANA KUSUMAH"
[3] "JATI PULO" "KALI BARU"
[5] "KAMPUNG TENGAH" "KERENDANG"
[7] "KRAMAT JATI" "PAL MERIAM"
[9] "PINANG RANTI" "RAWA JATI"
unique(bd_subdistrict[!(bd_subdistrict %in% cases_subdistrict)])
[1] "KALIBARU" "KRENDANG"
[3] "RAWAJATI" "TENGAH"
[5] "BALEKAMBANG" "PINANGRANTI"
[7] "JATIPULO" "PALMERIAM"
[9] "KRAMATJATI" "HALIM PERDANA KUSUMA"
# initialise a dataframe of our cases vs bd subdistrict spelling
spelling <- data.frame(
Aspatial_Cases=c("BALE KAMBANG", "HALIM PERDANA KUSUMAH", "JATI PULO", "KALI BARU", "KAMPUNG TENGAH", "KRAMAT JATI", "KERENDANG", "PAL MERIAM", "PINANG RANTI", "RAWA JATI"),
Geospatial_BD=c("BALEKAMBAG", "HALIM PERDANA KUSUMA", "JATIPULO", "KALIBARU", "TENGAH", "KRAMATJATI", "KRENDANG", "PALMERIAM", "PINANGRANTI", "RAWAJATI")
)
# with dataframe a input, outputs a kable
library(knitr)
library(kableExtra)
kable(spelling, caption="Mismatched Records") %>%
kable_material("hover", latex_options="scale_down")
Aspatial_Cases | Geospatial_BD |
---|---|
BALE KAMBANG | BALEKAMBAG |
HALIM PERDANA KUSUMAH | HALIM PERDANA KUSUMA |
JATI PULO | JATIPULO |
KALI BARU | KALIBARU |
KAMPUNG TENGAH | TENGAH |
KRAMAT JATI | KRAMATJATI |
KERENDANG | KRENDANG |
PAL MERIAM | PALMERIAM |
PINANG RANTI | PINANGRANTI |
RAWA JATI | RAWAJATI |
Now that we know which sub-district records are mismatched, we need to rectify the mismatches by renaming them:
# where bd_jakarta is a mismatched value, replace with the correct value
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'BALEKAMBANG'] <- 'BALE KAMBANG'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'HALIM PERDANA KUSUMA'] <- 'HALIM PERDANA KUSUMAH'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'JATIPULO'] <- 'JATI PULO'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'KALIBARU'] <- 'KALI BARU'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'TENGAH'] <- 'KAMPUNG TENGAH'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'KRAMATJATI'] <- 'KRAMAT JATI'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'KRENDANG'] <- 'KERENDANG'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'PALMERIAM'] <- 'PAL MERIAM'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'PINANGRANTI'] <- 'PINANG RANTI'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'RAWAJATI'] <- 'RAWA JATI'
Now, we have a standardised common identifier among our geospatial and aspatial dataframes. Let’s join them once more:
# joins cases_jakarta to bd_jakarta based on Sub_District
combined_jakarta <- left_join(bd_jakarta, cases_jakarta,
by=c("Sub_District"="Sub_District")
)
Now, let’s once again visualise our updated combined_jakarta
in terms of cases and deaths:
# maps the Cases and Deaths side-by-side
updated_cases = tm_shape(combined_jakarta)+
tm_fill("Cases") +
tm_borders(alpha = 0.5) +
tm_layout(main.title="Updated Cases Count")
updated_deaths = tm_shape(combined_jakarta)+
tm_fill("Deaths") +
tm_borders(alpha = 0.5) +
tm_layout(main.title="Updated Death Count")
tmap_arrange(updated_cases, updated_deaths)
Before we move into into EDA and thematic mapping, we need to calculate the cumulative confirmed cases rate (i.e. cases per 10000 population) and the cumulative death rate by month, as per assignment requirements.
The cumulative confirmed cases rate can be calculated by summing up the number of cases (based on the name of the sub-district and the date), and then dividing by the total population - but do note that the total population represented in our dataframe represents 10,000 denominations (in other words, this is number of cases for every 10,000 people). This can easily be achieved with pivot_wider().
# grouping based on the sub-district and date
# the cumulative_case_rate is based on the sum of cases over the total population
cases_rate <- cases_jakarta %>%
inner_join(bd_jakarta, by=c("Sub_District" = "Sub_District")) %>%
group_by(Sub_District, Date) %>%
dplyr::summarise(`cumulative_case_rate` = ((sum(Cases)/(Total_Population))*10000)) %>%
#afterwards, pivots the table based on the Dates, using the cumulative case rate as the values
ungroup() %>% pivot_wider(names_from = Date,
values_from = cumulative_case_rate)
Our case_rate
will look like this:
Assumption: Cumulative Death Rate refers to the total number of deaths, divided by population. For total number of deaths over confirmed cases, also known as the Case Fatality Ratio, please refer to section 5.5.3 below!
# grouping based on the sub-district and date
# the cumulative_case_rate is based on the sum of deaths over the total population
death_rate <- cases_jakarta %>%
inner_join(bd_jakarta, by=c("Sub_District" = "Sub_District")) %>%
group_by(Sub_District, Date) %>%
dplyr::summarise(`cumulative_death_rate` = ((sum(Deaths)/(Total_Population))*10000)) %>%
#afterwards, pivots the table based on the Dates, showing the cumulative death rate
ungroup() %>% pivot_wider(names_from = Date,
values_from = cumulative_death_rate)
Case Fatality Ratio, also known as CFR is the proportion of individuals diagnosed with a disease who die from that disease. Within this context, it refers to the number of people who contacted COVID-19 and passed away. The CFR is used as a measure of severity among detected cases.
The Case Fatality Ratio can be represented as such: \[ \text{Case Fatality ratio (CFR, in%)} = \frac{\text{Number of deaths from disease}}{\text{Number of confirmed cases of disease}} * 100 \]
# grouping based on the sub-district and date
# the case_fatality_ratio is based on the sum of deaths over confirmed cases
# note that the CFR is in % form, whcih is why we *100
CFR <- cases_jakarta %>%
inner_join(bd_jakarta, by=c("Sub_District" = "Sub_District")) %>%
group_by(Sub_District, Date) %>%
dplyr::summarise(`case_fatality_ratio` = ((sum(Deaths)/sum(Cases))*100)) %>%
#afterwards, pivots the table based on the Dates, showing the cumulative death rate
ungroup() %>% pivot_wider(names_from = Date,
values_from = case_fatality_ratio)
Note that for CFR, there are some NaN values on account of the case rate (our denominator) being zero. To accurately reflect the CFR, we will not remove those rows or transform from NaN to 0: on a map, they should reflect as ‘missing values’.
Before we move on into the mapping, we should convert these dataframes into sf objects.
combined_jakarta <- st_as_sf(combined_jakarta)
# need to join our previous dataframes with the geospatial data to ensure that geometry column is present
cases_rate <- cases_rate%>% left_join(bd_jakarta, by=c("Sub_District"="Sub_District"))
cases_rate <- st_as_sf(cases_rate)
death_rate <- death_rate%>% left_join(bd_jakarta, by=c("Sub_District"="Sub_District"))
death_rate <- st_as_sf(death_rate)
CFR <- CFR%>% left_join(bd_jakarta, by=c("Sub_District"="Sub_District"))
CFR <- st_as_sf(CFR)
I’m planning to use the jenks classification method. It ‘seeks to minimize the average deviation from the class mean while maximizing the deviation from the means of the other groups’ (source) and tends to identify real, natural classes within the data. However, jenks will not work as well if the data has a low variance, so let’s check the variance:
# commented out most of the variances due to the increasing variance over time
var(cases_rate$`2020-03-31`)
[1] 10.14244
# var(cases_rate$`2020-04-30`)
# var(cases_rate$`2020-05-31`)
# var(cases_rate$`2020-06-30`)
# var(cases_rate$`2020-07-31`)
# var(cases_rate$`2020-08-31`)
# var(cases_rate$`2020-09-30`)
# var(cases_rate$`2020-10-31`)
# var(cases_rate$`2020-11-30`)
# var(cases_rate$`2020-12-31`)
# var(cases_rate$`2021-01-30`)
# var(cases_rate$`2021-02-28`)
# var(cases_rate$`2021-03-31`)
# var(cases_rate$`2021-04-30`)
# var(cases_rate$`2021-05-31`)
# var(cases_rate$`2021-06-30`)
var(cases_rate$`2021-07-31`)
[1] 105409.9
Since the variance is increasing over time and seems significant, we can continue with jenks classification. After some testing, having 6 classes seems to be the optimum: too many classes and it becomes hard for the human eye to differentiate between the gradients, while too few classes makes it hard for any differentiation to be seen.
library(tmap)
# using the jenks method, with 6 classes
tmap_mode("plot")
tm_shape(cases_rate)+
tm_fill("2020-03-31",
n= 6,
style = "jenks",
title = "Case Rate") +
tm_layout(main.title = "Distribution of COVID-19 Case Rate in March 2020",
main.title.position = "center",
main.title.size = 1,
legend.height = 0.5,
legend.width = 0.4,
frame = TRUE) +
tm_borders(alpha = 0.5)
We should plot it for all the months - so let’s have a helper function for plotting this!
# input: the dataframe and the variable name - in this case, the month
# with style="jenks" for the jenks classification method
jenks_plot <- function(df, varname) {
tm_shape(cases_rate) +
tm_polygons() +
tm_shape(df) +
tm_fill(varname,
n= 6,
style = "jenks",
title = "Case Rate") +
tm_layout(main.title = varname,
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5)
}
Now, let’s visualise the jenks plots for all months:
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(jenks_plot(cases_rate, "2020-03-31"),
jenks_plot(cases_rate, "2020-04-30"),
jenks_plot(cases_rate, "2020-05-31"),
jenks_plot(cases_rate, "2020-06-30"))
tmap_arrange(jenks_plot(cases_rate, "2020-07-31"),
jenks_plot(cases_rate, "2020-08-31"),
jenks_plot(cases_rate, "2020-09-30"),
jenks_plot(cases_rate, "2020-10-31"))
tmap_arrange(jenks_plot(cases_rate, "2020-11-30"),
jenks_plot(cases_rate, "2020-12-31"),
jenks_plot(cases_rate, "2021-01-30"),
jenks_plot(cases_rate, "2021-02-28"))
tmap_arrange(jenks_plot(cases_rate, "2021-03-31"),
jenks_plot(cases_rate, "2021-04-30"),
jenks_plot(cases_rate, "2021-05-31"),
jenks_plot(cases_rate, "2021-06-30"))
tmap_arrange(jenks_plot(cases_rate, "2021-07-31"))
Do note that each map has its own relative case rate: the ranges gradually grow larger over time with the greater influx of cases. By comparing the change of case rates over the months, there are a number of observations we can make:
SENAYAN
sub-district with the highest cases_rate, which is likely where the first cases initially appeared.GAMBIR
, which would then remain as the sub-district with the highest cases_rate until the end of our data (July 2021)SENAYAN
, circulating within central Jakarta before taking root in GAMBIR
.GAMBIR
)Let us check for the sub-districts with the highest cases rate at the various stages:
# to check for darkest sub-district in early stages
cases_rate$Village[which.max(cases_rate$`2020-03-31`)]
[1] "SENAYAN"
# to check for darkest sub-district in the later stages
cases_rate$Village[which.max(cases_rate$`2020-08-31`)]
[1] "GAMBIR"
# reorder based on decreasing values for Feb 2021, and assign to a new df
cases_rate_descending <- cases_rate[order(cases_rate$`2021-02-28`,decreasing = TRUE),]
cases_rate_descending_top5 <- cases_rate_descending[1:5,]
cases_rate_descending_top5[,"Sub_District"]
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -3630194 ymin: 679978.2 xmax: -3623456 ymax: 690083.2
Projected CRS: DGN95 / Indonesia TM-3 zone 54.1
# A tibble: 5 x 2
Sub_District geometry
<chr> <MULTIPOLYGON [m]>
1 GAMBIR (((-3624851 689984.3, -3624853 689959.1, -3624852 68~
2 SENAYAN (((-3626081 682271.9, -3626092 682263.2, -3626128 68~
3 KUNINGAN TIMUR (((-3623728 682537.6, -3623715 682486.2, -3623685 68~
4 GUNUNG (((-3628388 681746.1, -3628389 681733.1, -3628387 68~
5 SETIA BUDI (((-3624919 685319, -3624920 685274.2, -3624915 6852~
As we can see, the top 5 sub-districts with the highest cases rate are:
GAMBIR
, which is consistently the top sub-district in that regard,SENAYAN
where the first batch of cases likely originated (in the sense that that area had multiple cases that increased and spread to surrounding sub-distrcits, not that that was where the first case occurred),KUNINGAN TIMUR
, GUNUNG
and SETIA BUDI
In the above section, we saw each map by their relative case rate - but to see the spatio-temporal progression of case rates, we should set a fixed range. In this case, customising our breakpoints would be ideal - just like above, we’ll be breaking into 6 classes.
Firstly, let’s find the maximum of the case_rates - since the highest case rate is logically in the latest month (July 2021), we’ll find the maximum case_rate from there:
# gets maximum value (case rate) of July 2021 from cases_rate df
max(cases_rate$`2021-07-31`)
[1] 3808.29
Now that we know our maximum number, let’s set our breakpoints:
# these breakpoints are chosen based on the jenks breaks ranges from the previous section
breakpoints = c(0, 4, 20, 49, 112, 550, 3808)
This time, let’s test it out on July 2021:
library(tmap)
tmap_mode("plot")
tm_shape(cases_rate)+
tm_fill("2021-07-31",
breaks= c(0, 4, 20, 49, 112, 1105, 3808),
title = "Case Rate") +
tm_layout(main.title = "Distribution of COVID-19 Case Rate in July 2021 (Spatio-Temporal)",
main.title.position = "center",
main.title.size = 1,
legend.height = 0.5,
legend.width = 0.4,
frame = TRUE) +
tm_borders(alpha = 0.5)
It might seem pretty skewed towards the ‘heavier’ end, but this is to be expected: after all, this is the end of our dataset where COVID-19 has had over a year of presence in Jakarta. Once again, let’s create a helper function to help us:
break_plot <- function(df, varname) {
tm_shape(cases_rate) +
tm_polygons() +
tm_shape(df) +
tm_fill(varname,
breaks= c(0, 4, 20, 49, 112, 550, 3808),
title = "Case Rate") +
tm_layout(main.title = varname,
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5)
}
Now, let’s visualise the custom break plots for all months:
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(break_plot(cases_rate, "2020-03-31"),
break_plot(cases_rate, "2020-04-30"),
break_plot(cases_rate, "2020-05-31"),
break_plot(cases_rate, "2020-06-30"))
tmap_arrange(break_plot(cases_rate, "2020-07-31"),
break_plot(cases_rate, "2020-08-31"),
break_plot(cases_rate, "2020-09-30"),
break_plot(cases_rate, "2020-10-31"))
tmap_arrange(break_plot(cases_rate, "2020-11-30"),
break_plot(cases_rate, "2020-12-31"),
break_plot(cases_rate, "2021-01-30"),
break_plot(cases_rate, "2021-02-28"))
tmap_arrange(break_plot(cases_rate, "2021-03-31"),
break_plot(cases_rate, "2021-04-30"),
break_plot(cases_rate, "2021-05-31"),
break_plot(cases_rate, "2021-06-30"))
tmap_arrange(break_plot(cases_rate, "2021-07-31"))
Made by exporting .pngs of individual plots into an online gifmaker
Now, we can supplement our observations in 6.2 with a spatio-temporal dimension! Here are the observations made:
SENAYAN
, the top sub-district with the highest cases rate, reached the highest class of case rate first. Accordingly, the other top sub-districts eventually joined the highest class of case rate by the end of the dataset.Like in our monthly cumulative cases in 6.1, we’ll be using the jenks classification method for our choropleth map, but we need to check the variance first:
var(death_rate$`2020-03-31`)
[1] 0.07458528
var(death_rate$`2020-04-30`)
[1] 0.3333565
var(death_rate$`2020-05-31`)
[1] 0.387508
var(death_rate$`2020-06-30`)
[1] 0.4573607
var(death_rate$`2020-07-31`)
[1] 0.5275186
var(death_rate$`2020-08-31`)
[1] 0.7535532
var(death_rate$`2020-09-30`)
[1] 0.990763
var(death_rate$`2020-10-31`)
[1] 1.156742
var(death_rate$`2020-11-30`)
[1] 1.313623
var(death_rate$`2020-12-31`)
[1] 1.787521
var(death_rate$`2021-01-30`)
[1] 3.001775
var(death_rate$`2021-02-28`)
[1] 4.046581
var(death_rate$`2021-03-31`)
[1] 5.269923
var(death_rate$`2021-04-30`)
[1] 5.513535
var(death_rate$`2021-05-31`)
[1] 6.851983
var(death_rate$`2021-06-30`)
[1] 7.674798
var(death_rate$`2021-07-31`)
[1] 13.12748
The low variance for the initial stages might seem concerning, but it increases over time to a significant level for majority of the data, so we’ll continue with this. In addition, we’ll be using custom breakpoints in the next section for our spatio-temporal analysis, which will supplement our observations.
We’ll need to tweak our jenks_plot
helper function a bit due to the class titles:
# input: the dataframe and the variable name - in this case, the month
# with style="jenks" for the jenks classification method
# minor tweak to tm_fill - changed title and palette
jenks_plot <- function(df, varname) {
tm_shape(cases_rate) +
tm_polygons() +
tm_shape(df) +
tm_fill(varname,
n= 6,
style = "jenks",
title = "Death Rate",
palette="Blues") +
tm_layout(main.title = varname,
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5)
}
Now, let’s visualise the jenks plots for all months:
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(jenks_plot(death_rate, "2020-03-31"),
jenks_plot(death_rate, "2020-04-30"),
jenks_plot(death_rate, "2020-05-31"),
jenks_plot(death_rate, "2020-06-30"))
tmap_arrange(jenks_plot(death_rate, "2020-07-31"),
jenks_plot(death_rate, "2020-08-31"),
jenks_plot(death_rate, "2020-09-30"),
jenks_plot(death_rate, "2020-10-31"))
tmap_arrange(jenks_plot(death_rate, "2020-11-30"),
jenks_plot(death_rate, "2020-12-31"),
jenks_plot(death_rate, "2021-01-30"),
jenks_plot(death_rate, "2021-02-28"))
tmap_arrange(jenks_plot(death_rate, "2021-03-31"),
jenks_plot(death_rate, "2021-04-30"),
jenks_plot(death_rate, "2021-05-31"),
jenks_plot(death_rate, "2021-06-30"))
tmap_arrange(jenks_plot(death_rate, "2021-07-31"))
Like before, do note that each map has its own relative death rate: the ranges gradually grow larger over time with the greater influx of deaths By comparing the change of death rates over the months, there are a number of observations we can make:
SENAYAN
, the sub-district with the highest cases rate, similarly has a high death rateKARET SEMANGGI
, a position it holds from March 2020 to December 2020GAMBIR
, the sub-district that would see the highest case rates till the end of our data (July 2021), actually had a relatively pale colouring compared to its fellow sub-districts. It was only from October 2021 onwards that it started turning dark, before attaining the highest death rate in Jan 2021Let us check for the sub-districts with the highest cases rate at the various stages:
# to check for darkest sub-district in early stages
death_rate$Village[which.max(death_rate$`2020-03-31`)]
[1] "KARET SEMANGGI"
# to check for darkest sub-district in the later stages
death_rate$Village[which.max(death_rate$`2021-02-28`)]
[1] "GAMBIR"
# reorder based on decreasing values for July 2021, and assign to a new df
death_rate_descending <- death_rate[order(death_rate$`2021-06-30`,decreasing = TRUE),]
death_rate_descending_top5 <- death_rate_descending[1:5,]
death_rate_descending_top5[,"Sub_District"]
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -3630194 ymin: 679978.2 xmax: -3623648 ymax: 690083.2
Projected CRS: DGN95 / Indonesia TM-3 zone 54.1
# A tibble: 5 x 2
Sub_District geometry
<chr> <MULTIPOLYGON [m]>
1 GAMBIR (((-3624851 689984.3, -3624853 689959.1, -3624852 68~
2 SELONG (((-3626336 681381.3, -3626335 681357.8, -3626355 68~
3 GUNUNG (((-3628388 681746.1, -3628389 681733.1, -3628387 68~
4 KARET KUNINGAN (((-3624392 683871, -3624371 683747.4, -3624220 6837~
5 SETIA BUDI (((-3624919 685319, -3624920 685274.2, -3624915 6852~
As we can see, the top 5 sub-districts with the highest death rate are:
GAMBIR
, which is consistently the top sub-district both death rate and cases rateGUNUNG
and SETIA BUDI
, both of which appared in the top 5 sub-districts with the highest cases rate, seems to reinforce the idea that the case rate is highly asociated with the death rateKUNINGAN TIMUR
AND SELONG
KARET SEMANGGI
which had high death rates from March 2020 till the end of the year did not appear in the top 5 sub-districtsInterestingly, despite SENAYAN
having one of the highest case rates, it does not appear to be ranked as highly in the death rate. This could indicate two things: either (a) SENAYAN
’s local government has pushed out and enforced successful measures involving the local community (or distribution of aid), which has helped the COVID-19-affected patients recover, or (b) the case rate is not as tightly coupled to the death rate as we initially thought. We do not have enough information to vouch for either, but it is worth for local governments on the district or provincial level to survey and check on the measures in place in SENAYAN
.
In the above section, we saw each map by their relative death rate - but to see the spatio-temporal progression of death rates, we should set a fixed range. Just like for our cumulative case rate in 6.3, we should customise our breakpoints and break into 6 classes.
Firstly, let’s find the maximum of the death_rate - with reference to the previous section’s graphs, it seems that the highest death rate is in the latest month (July 2021), we’ll find the maximum death_rate from there:
# gets maximum value (case rate) of July 2021 from cases_rate df
max(death_rate$`2021-07-31`)
[1] 42.09845
Now that we know our maximum number, let’s set our breakpoints:
# these breakpoints are chosen based on the jenks breaks ranges from the previous section
breakpoints = c(0, 0.4, 1.3, 3.4, 6.5, 12, 42)
Once again, let’s tweak our previous helper function:
# minor tweak to tm_fill - change title and palette
break_plot <- function(df, varname) {
tm_shape(cases_rate) +
tm_polygons() +
tm_shape(df) +
tm_fill(varname,
breaks= c(0, 0.4, 1.3, 3.4, 6.5, 12, 42),
title = "Death Rate",
palette="Blues") +
tm_layout(main.title = varname,
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5)
}
Now, let’s visualise the custom break plots for all months:
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(break_plot(death_rate, "2020-03-31"),
break_plot(death_rate, "2020-04-30"),
break_plot(death_rate, "2020-05-31"),
break_plot(death_rate, "2020-06-30"))
tmap_arrange(break_plot(death_rate, "2020-07-31"),
break_plot(death_rate, "2020-08-31"),
break_plot(death_rate, "2020-09-30"),
break_plot(death_rate, "2020-10-31"))
tmap_arrange(break_plot(death_rate, "2020-11-30"),
break_plot(death_rate, "2020-12-31"),
break_plot(death_rate, "2021-01-30"),
break_plot(death_rate, "2021-02-28"))
tmap_arrange(break_plot(death_rate, "2021-03-31"),
break_plot(death_rate, "2021-04-30"),
break_plot(death_rate, "2021-05-31"),
break_plot(death_rate, "2021-06-30"))
tmap_arrange(break_plot(death_rate, "2021-07-31"))
Let’s supplement our observations in 7.2: - Similar to the case rate in 6.4, the general trend of the spread seems to (a) target neighboring districts and (b) ‘grows from the center’, albeit more patchily and with less intensity than what we saw for the case rate spatio-temporal graphs. The cross-border nature of COVID-19 naturally leads to more cases, which naturally leads to more deaths by COVID-19 complciations, thus increasing the death over population rate. - Contrasting the cases rate map, and related to the low variances of the initial months, there are paler areas all around for the first few months (March 2020~June 2020) before a a marked uptick (and darkening areas) is observed from July onwards. This is in-line with real world events: the life-threatening danger of COVID-19 was always present, but strains of new and nastier variants, alongside stretched resources, which were introduced as time went on are likely to have a hand in the higher death rate.
Onto our final mapping! Like the previous two, we’ll be using the jenks classification method for our choropleth map - but this time, we won’t be doing a spatio-temporal map, mainly becuase CFR is not cumulative and thus the map of one month does not ‘build upon’ the map of the previous.
We’ll need to tweak our jenks_plot
helper function a bit due to the class titles:
# input: the dataframe and the variable name - in this case, the month
# with style="jenks" for the jenks classification method
# minor tweak to tm_fill - changed title and palette
jenks_plot <- function(df, varname) {
tm_shape(cases_rate) +
tm_polygons() +
tm_shape(df) +
tm_fill(varname,
n= 6,
style = "jenks",
title = "CFR",
palette="Reds") +
tm_layout(main.title = varname,
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5)
}
Now, let’s visualise the jenks plots for all months:
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(jenks_plot(CFR, "2020-03-31"),
jenks_plot(CFR, "2020-04-30"),
jenks_plot(CFR, "2020-05-31"),
jenks_plot(CFR, "2020-06-30"))
tmap_arrange(jenks_plot(CFR, "2020-07-31"),
jenks_plot(CFR, "2020-08-31"),
jenks_plot(CFR, "2020-09-30"),
jenks_plot(CFR, "2020-10-31"))
tmap_arrange(jenks_plot(CFR, "2020-11-30"),
jenks_plot(CFR, "2020-12-31"),
jenks_plot(CFR, "2021-01-30"),
jenks_plot(CFR, "2021-02-28"))
tmap_arrange(jenks_plot(CFR, "2021-03-31"),
jenks_plot(CFR, "2021-04-30"),
jenks_plot(CFR, "2021-05-31"),
jenks_plot(CFR, "2021-06-30"))
tmap_arrange(jenks_plot(CFR, "2021-07-31"))
By comparing the change of CFR over the months, there are a number of observations we can make:
SENAYAN
, the top sub-district with the highest cases rate, has a low CFR - which is notable, considering that it’s near Central Jakarta where there are a few sub-districts with intensely high CFRAs stated when we first started this analysis, our goal is to both visualise and understand the patterns shown in the data. As such, this section is to provide some background information as to the spread of the cases and their corresponding death rate - and especially why we see such a violent surge in the cases rate and death rate.
As an article discussing Indonesia’s COVID-19 Response Strategies points out, Indonesia suffered poor response time and inefficient strategies for mobilising health resources, an opinion that is reiterated in an article explaning the surge in Indonesia’s cases.
In addition, poor data-sharing between upper levels of government (country-wide, provincial) and the local governments (district, sub-district) lead to a stifling of local government actions that made it hard for them to act without data or authority - which in turn fed into the lack of community participation. This was not taken advatnage of in the earlier stages when it was easiest to suppress the cases, which in turn caused it to snowball into something critical. This is aggravated by their data management, which has 3 major flaws
As the COVID-19 pandemic continues, there is an increasing need for transparency and efficiency of data sharing, especially for a country that has governmental bodies on multiple subdivisions such as Indonesia. Increasingly, local governments find themselves needing to address the need to make information available to the public, and even the UN encourages governments around the world to embrace the digital government
This data is especially significant for the finer-grained levels of local governments - our sub-distrcits. They are the closest to the people in terms of response time, capability of enforcing quarantines and most importantly distributing aid, and oftentimes are more effective and targeted than government efforts. Their sub-district data, in turn, feeds back to the higher-ups: the districts, provinces, and finally country-wide - who can then adjust their regulations, send targeted aid to particular areas, and better address the situation. In other words, this data enables the different levels of government of the country to work towards a common goal like a well-oiled machine.
Such data isn’t just limited by sharing up-to-date information, but also extends to visualising said data in an accurate and easy-to-understand format. In a situation where response time is key, it is critical to have your readers understand the data from a few key visualisations and start planning how to take action. In addition, such data is transparent to the citizens as well - giving them a gauge of the risk areas and keeping them up-to-date on the measures.
This is the significance of geospatial analytics and of this analysis: to ensure that data is accurate and understandable for all readers regardless of their background or experience, in the hopes of enabling action and keeping your readers in the loop.
While the COVID-19 pandemic is still ongoing, there are particular events that will undoubtedly lead to spikes in the cases, even with heavy government intervention. For example, a number of citizens circumvented the ban on mudik to visit their for the Hari Raya Aidilfitri holidays -
Additionally distinctions are made between the variant/strain of COVID-19 cases (alpha to delta), I believe we can create a more accurate analysis with more potential information to be gleaned. This is in light of the discovery of the Delta Variant of COVID-19, where its fast spread between sub-districts implies that we should have a separate analysis tracking its spread so as to best contain it and deliver aid to the affected sub-districts.
Thank you Prof. Kam for all the lecture notes and hands-on exercises - I took a great deal of reference from them, from data wrangling to mapping!
Here are the list of resources used in this analysis, as well as their links. Special thanks go to the Senior Sample Submissions that Prof. Kam recommended! ❤️