Take-Home Exercise 1: Analysing and Visualising Spatio-temporal Patterns of COVID-19 in DKI Jakarta, Indonesia

Take-Home Exercise R sf tidyverse tmap maptools kableExtra plyr

This analysis aims to visualise and understand the spatio-temporal patterns of COVID-19 cases in DKI Jakarta and its sub-districts.

Megan Sim https://www.linkedin.com/in/megan-sim-tze-yen/
09-10-2021

1.0 Overview

1.1 Background

On 2nd March 2020, alarm bells rang across Indonesia: newspapers countrywide printed the same headline, radio stations blared, and worried eyes turned to a televised statement from Preisdent Joko Widodo.

He had just confirmed the first two cases of COVID-19 in the country.

This would mark the start of COVID-19’s influence on Indonesia, spanning over 4 million confirmed cases and over 130,000 reported cumulative deaths to date of writing, and marking Indonesia as the country with the highest number of report COVID-19 cases in Southeast Asia. A quick look at Indonesia’s official COVID-19 Distribution Map introduces a statistical overview and an approximate geographical estimation of the hardest-hit areas. Yet the breadth of its content belies a level of generalisation - which makes it hard to do analysis on a finer-grained level.

Why does this matter? Out of the 34 provinces in Indonesia, DKI Jakarta made up close to 24% of cumulative confirmed cases - a shocking statistic that bears looking into. As such, our goal is to dive deeper into sub-district levels of data and visualise the spatio-temporal patterns in DKI Jakarta.

1.2 Problem Statement

This analysis aims to visualise and understand the spatio-temporal patterns of COVID-19 cases in DKI Jakarta and its sub-districts, with our goals being:

2.0 Setup

2.1 Packages Used

The R packages we’ll use for this analysis are:

In addition, the following tidyverse packages will be used:

Show code
# initialise a list of required packages
# note that 'readxl' is in the list of packages despite being part of tidyverse: as readxl is not a core tidyverse package, it needs to be loaded explicitly
packages = c('plyr', 'sf', 'tidyverse', 'readxl', 'tmap', 'maptools', 'kableExtra')

# for each package, check if installed and if not, install it
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2 Datasets Used

Show code
# initialise a dataframe of our geospatial and aspatial data details
datasets <- data.frame(
  Type=c("Geospatial",
         "Aspatial"),
  Name=c("[Batas Desa Provinsi DKI Jakarta](https://www.indonesia-geospasial.com/2020/04/download-shapefile-shp-batas-desa.html)",
         "[Standar Kelurahan Data Corona (Monthly)](https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/)"),
  Format=c("Shapefile", 
           ".xlsx"),
  Description=c("Sub-districts in DKI Jakarta",
                "Sub-district level data of daily COVID-19 cases in DKI Jakarta  
                between March 2020~July 2021")
  )

# with reference to this guide on kableExtra:
# https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
# kable_material is the name of the kable theme
# 'hover' for to highlight row when hovering, 'scale_down' to adjust table to fit page width
library(knitr)
library(kableExtra)
kable(head(datasets), caption="Datasets Used") %>%
  kable_material("hover", latex_options="scale_down")
Table 1: Datasets Used
Type Name Format Description
Geospatial Batas Desa Provinsi DKI Jakarta Shapefile Sub-districts in DKI Jakarta
Aspatial Standar Kelurahan Data Corona (Monthly) .xlsx Sub-district level data of daily COVID-19 cases in DKI Jakarta
between March 2020~July 2021

To retrieve the monthly cumulative records for the COVID-19 cases in Jakarta, we took the data compiled on the last day of each month, with an exception: there was a broken link for the 31st January 2021 data file, hence the 30th January 2021 data file was used instead.

3.0 Data Wrangling: Geospatial Data

3.1 Importing Geospatial Data

Show code
# bd stands for "batas desa", translated as "village boundary"
# reads in geospatial data and stores into bd_jakarta dataframe
bd_jakarta <- st_read(dsn="data/geospatial",
                      layer="BATAS_DESA_DESEMBER_2019_DUKCAPIL_DKI_JAKARTA")
Reading layer `BATAS_DESA_DESEMBER_2019_DUKCAPIL_DKI_JAKARTA' from data source `C:\IS415\IS415_msty\_posts\2021-09-10-take-home-exercise-1\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 269 features and 161 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 106.3831 ymin: -6.370815 xmax: 106.9728 ymax: -5.184322
Geodetic CRS:  WGS 84

From the output message, we learn that:

3.2 Data Pre-Processing

Before we even start visualising our data, we have to first check for two things: invalid geometries and missing values, which could impact future calculations and representations.

Reference was taken from the senior sample submissions for the code for this section, with credit to Xiao Rong Wong’s Geographic Analysis of the Supply & Demand of Childcare Services in Singapore

3.2.1 Invalid Geometries

Let us first check for invalid geometries:

Show code
# function breakdown:
# the st_is_valid function checks whether a geometry is valid
# which returns the indices of certain values based on logical conditions
# length returns the length of data objects

# checks for the number of geometries that are NOT valid
length(which(st_is_valid(bd_jakarta) == FALSE))
[1] 0
Show code
# Alternative Method
# test <- st_is_valid(bd_jakarta,reason=TRUE)
# length(which(test!= "Valid Geometry"))
# credit to Rajiv Abraham Xavier https://rpubs.com/rax/Take_Home_Ex01

There are no invalid geometries.

3.2.2 Missing Values

Now, let us check for missing values:

Show code
# the rowSums(is.na(bd_jakarta))!=0 checks every row if there are NA values, returning TRUE or FALSE
# the bd_jakarta 'wrapper' prints said rows that contain NA values
bd_jakarta[rowSums(is.na(bd_jakarta))!=0,]
Simple feature collection with 2 features and 161 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 106.8412 ymin: -6.154036 xmax: 106.8612 ymax: -6.144973
Geodetic CRS:  WGS 84
    OBJECT_ID KODE_DESA             DESA   KODE    PROVINSI KAB_KOTA
243     25645  31888888     DANAU SUNTER 318888 DKI JAKARTA     <NA>
244     25646  31888888 DANAU SUNTER DLL 318888 DKI JAKARTA     <NA>
    KECAMATAN DESA_KELUR JUMLAH_PEN JUMLAH_KK LUAS_WILAY KEPADATAN
243      <NA>       <NA>          0         0          0         0
244      <NA>       <NA>          0         0          0         0
    PERPINDAHA JUMLAH_MEN PERUBAHAN WAJIB_KTP SILAM KRISTEN KHATOLIK
243          0          0         0         0     0       0        0
244          0          0         0         0     0       0        0
    HINDU BUDHA KONGHUCU KEPERCAYAA PRIA WANITA BELUM_KAWI KAWIN
243     0     0        0          0    0      0          0     0
244     0     0        0          0    0      0          0     0
    CERAI_HIDU CERAI_MATI U0 U5 U10 U15 U20 U25 U30 U35 U40 U45 U50
243          0          0  0  0   0   0   0   0   0   0   0   0   0
244          0          0  0  0   0   0   0   0   0   0   0   0   0
    U55 U60 U65 U70 U75 TIDAK_BELU BELUM_TAMA TAMAT_SD SLTP SLTA
243   0   0   0   0   0          0          0        0    0    0
244   0   0   0   0   0          0          0        0    0    0
    DIPLOMA_I DIPLOMA_II DIPLOMA_IV STRATA_II STRATA_III BELUM_TIDA
243         0          0          0         0          0          0
244         0          0          0         0          0          0
    APARATUR_P TENAGA_PEN WIRASWASTA PERTANIAN NELAYAN AGAMA_DAN
243          0          0          0         0       0         0
244          0          0          0         0       0         0
    PELAJAR_MA TENAGA_KES PENSIUNAN LAINNYA GENERATED KODE_DES_1
243          0          0         0       0      <NA>       <NA>
244          0          0         0       0      <NA>       <NA>
    BELUM_ MENGUR_ PELAJAR_ PENSIUNA_1 PEGAWAI_ TENTARA KEPOLISIAN
243      0       0        0          0        0       0          0
244      0       0        0          0        0       0          0
    PERDAG_ PETANI PETERN_ NELAYAN_1 INDUSTR_ KONSTR_ TRANSP_ KARYAW_
243       0      0       0         0        0       0       0       0
244       0      0       0         0        0       0       0       0
    KARYAW1 KARYAW1_1 KARYAW1_12 BURUH BURUH_ BURUH1 BURUH1_1
243       0         0          0     0      0      0        0
244       0         0          0     0      0      0        0
    PEMBANT_ TUKANG TUKANG_1 TUKANG_12 TUKANG__13 TUKANG__14
243        0      0        0         0          0          0
244        0      0        0         0          0          0
    TUKANG__15 TUKANG__16 TUKANG__17 PENATA PENATA_ PENATA1_1 MEKANIK
243          0          0          0      0       0         0       0
244          0          0          0      0       0         0       0
    SENIMAN_ TABIB PARAJI_ PERANCA_ PENTER_ IMAM_M PENDETA PASTOR
243        0     0       0        0       0      0       0      0
244        0     0       0        0       0      0       0      0
    WARTAWAN USTADZ JURU_M PROMOT ANGGOTA_ ANGGOTA1 ANGGOTA1_1
243        0      0      0      0        0        0          0
244        0      0      0      0        0        0          0
    PRESIDEN WAKIL_PRES ANGGOTA1_2 ANGGOTA1_3 DUTA_B GUBERNUR
243        0          0          0          0      0        0
244        0          0          0          0      0        0
    WAKIL_GUBE BUPATI WAKIL_BUPA WALIKOTA WAKIL_WALI ANGGOTA1_4
243          0      0          0        0          0          0
244          0      0          0        0          0          0
    ANGGOTA1_5 DOSEN GURU PILOT PENGACARA_ NOTARIS ARSITEK AKUNTA_
243          0     0    0     0          0       0       0       0
244          0     0    0     0          0       0       0       0
    KONSUL_ DOKTER BIDAN PERAWAT APOTEK_ PSIKIATER PENYIA_ PENYIA1
243       0      0     0       0       0         0       0       0
244       0      0     0       0       0         0       0       0
    PELAUT PENELITI SOPIR PIALAN PARANORMAL PEDAGA_ PERANG_ KEPALA_
243      0        0     0      0          0       0       0       0
244      0        0     0      0          0       0       0       0
    BIARAW_ WIRASWAST_ LAINNYA_12 LUAS_DESA KODE_DES_3 DESA_KEL_1
243       0          0          0         0       <NA>       <NA>
244       0          0          0         0       <NA>       <NA>
    KODE_12                       geometry
243       0 MULTIPOLYGON (((106.8612 -6...
244       0 MULTIPOLYGON (((106.8504 -6...

That’s a long output! There are too many columns, and it’ll clutter up future analysis - which we’ll take care of in 3.5. As we can see, there are two particular rows with missing values for KAB_KOTA (City), KECAMATAN (District) and DESA_KELUR (Village), as well as a few other fields - specifically, OBJECT_ID 25645 and 25646. We’ll remove them:

Show code
# removes rows that have an NA value in DESA_KELUR
# in context of this data, we can use other columns, such as KAB_KOTA or KECAMATAN
# but since we're looking at this on a sub-district level, DESA_KELUR seemed most appropriate
bd_jakarta <- na.omit(bd_jakarta,c("DESA_KELUR"))

3.3 Verifying + Transforming Coordinate System

Let us check the CRS of bd_jarkarta:

Show code
# retrieves the coordinate system of bd_jakarta
st_crs(bd_jakarta)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

As we can see, the assigned coordinates system is WGS 84, the ‘World Geodetic System 1984’. In the context of this dataset, this isn’t appropriate: as this is an Indonesian-specific geospatial dataset, we should be using the national CRS of Indonesia, DGN95, the ‘Datum Geodesi Nasional 1995’, ESPG code 23845. Let’s rectify that:

Show code
# transforms the CRS to DGN95, ESPG code 23845
bd_jakarta <- st_transform(bd_jakarta, 23845)

Let us check if the CRS has been properly assigned:

Show code
st_crs(bd_jakarta)
Coordinate Reference System:
  User input: EPSG:23845 
  wkt:
PROJCRS["DGN95 / Indonesia TM-3 zone 54.1",
    BASEGEOGCRS["DGN95",
        DATUM["Datum Geodesi Nasional 1995",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4755]],
    CONVERSION["Indonesia TM-3 zone 54.1",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",139.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9999,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",200000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",1500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre."],
        AREA["Indonesia - onshore east of 138°E."],
        BBOX[-9.19,138,-1.49,141.01]],
    ID["EPSG",23845]]

Success! 😄

3.4 Removal of Outer Islands

We’ve finished our standard pre-processing! Let’s quickly visualise the data…

Show code
# plots the geometry only
# alternatively, we can use plot(bd_jakarta$geometry)
plot(st_geometry(bd_jakarta))

As we can see, bd_jakarta includes both the mainland and the outer islands. As per the assignment requirements, since the outer islands aren’t relevant to our analysis, we’ll have to remove them.

There are a few ways we can do this - but the best way would be to explore our data. From 3.2.2, we saw an output of all the fields - and drew attention to KAB_KOTA (City), KECAMATAN (District) and DESA_KELUR (Village).

Of these three, KAB_KOTA would be the logical choice: it’s the coarsest-grained level of distinction. Let’s check for its unique values:

Show code
# ouputs unique values of the KAB_KOTA field
unique(bd_jakarta$"KAB_KOTA")
[1] "JAKARTA BARAT"    "JAKARTA PUSAT"    "KEPULAUAN SERIBU"
[4] "JAKARTA UTARA"    "JAKARTA TIMUR"    "JAKARTA SELATAN" 

As we can see, all cities within Jakarta have a JAKARTA prefix, while KEPULAUAN SERIBU (translated to ‘Thousand Islands’) refers to the outer islands. Just to check, we can visualise KAB_KOTA:

Show code
# with bd_jakarta as the input data (setting the 'base')
# draw the KAB_KOTA (city) polygons
# essentially shades the map according to the city divisions
tm_shape(bd_jakarta) + 
  tm_polygons("KAB_KOTA")

Now that we know how to identify the outer islands, it’s time to remove them:

Show code
# filters out the outer islands by accepting only if the value of KAB_KOTA is NOT KEPULAUAN SERIBU
bd_jakarta <- filter(bd_jakarta, KAB_KOTA != "KEPULAUAN SERIBU")

3.5 Retaining first 9 fields of bd_jakarta

In addition, the assignment requires us to retain the only the fields relevant to our analysis - in other words, the first 9 fields in the bd_jakarta data frame.

Show code
# filters out other fields by accepting only the first 9 fields
bd_jakarta <- bd_jakarta[, 0:9]

3.6 Renaming Columns with Translation

Lastly, let’s translate the column names of bd_jakarta into English for ease of comprehension:

Show code
# with reference to: https://www.codegrepper.com/code-examples/r/rename+column+name+in+r
# renames the columns in the style New_Name = OLD_NAME
bd_jakarta <- bd_jakarta %>% 
  dplyr::rename(
    Object_ID=OBJECT_ID,
    Province=PROVINSI, 
    City=KAB_KOTA, 
    District=KECAMATAN, 
    Village_Code=KODE_DESA, 
    Village=DESA, 
    Sub_District=DESA_KELUR,
    Code=KODE, 
    Total_Population=JUMLAH_PEN
    )

3.7 Brief EDA

Before we move on into importing the aspatial data and into the meat of our geovisualiastion, it’s important to get a feel for the data that we’re working with, especially on the sub-district level. As such, let’s take a quick glimpse:

Show code
# reveals the data type of all fields + some values
glimpse(bd_jakarta)
Rows: 261
Columns: 10
$ Object_ID        <dbl> 25477, 25478, 25397, 25400, 25390, 25391, 2~
$ Village_Code     <chr> "3173031006", "3173031007", "3171031003", "~
$ Village          <chr> "KEAGUNGAN", "GLODOK", "HARAPAN MULIA", "CE~
$ Code             <dbl> 317303, 317303, 317103, 317103, 317102, 317~
$ Province         <chr> "DKI JAKARTA", "DKI JAKARTA", "DKI JAKARTA"~
$ City             <chr> "JAKARTA BARAT", "JAKARTA BARAT", "JAKARTA ~
$ District         <chr> "TAMAN SARI", "TAMAN SARI", "KEMAYORAN", "K~
$ Sub_District     <chr> "KEAGUNGAN", "GLODOK", "HARAPAN MULIA", "CE~
$ Total_Population <dbl> 21609, 9069, 29085, 41913, 15793, 33383, 35~
$ geometry         <MULTIPOLYGON [m]> MULTIPOLYGON (((-3626874 69...~
Show code
# checks for number of unique sub-districts
length(unique(bd_jakarta$"Sub_District"))
[1] 261
Show code
length(unique(bd_jakarta$"District"))
[1] 42

Note that there are 261 unique sub-districts, and 42 unique distrcits. The max number of categories for mapping with tmap is 30 - and even though we can adjust the max.categories in tmap_options, 261 segmented sections (and even 42 sections) on a single map is too fragmented and minute for human eyes. As such, the only level which is human-readable is the ‘City’ level:

Show code
# shades the map according to the city divisions
tm_shape(bd_jakarta) + 
  tm_polygons("City")

4.0 Data Wrangling: Aspatial Data

4.1 Pre-Importing EDA

In our ‘data/aspatial’ folder, we have multiple .xlsx files ranging from 30 March 2020 to July 2021. However, before we start compiling all of our data, it’s important to understand what we’re working with and to check for any discrepancies, so let’s perform a brief EDA:

Show code
# reads the 31st March 2020 .xlsx aspatial data and stores it as mar2020
mar2020 <- read_xlsx("data/aspatial/Standar Kelurahan Data Corona (31 Maret 2020 Pukul 08.00).xlsx")
glimpse(mar2020)
Rows: 270
Columns: 18
$ ID_KEL...1           <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ ID_KEL...2           <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ Nama_provinsi        <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ nama_kota            <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ nama_kecamatan       <chr> NA, "BELUM DIKETAHUI", "LUAR DKI JAKART~
$ nama_kelurahan       <chr> "TOTAL", "BELUM DIKETAHUI", "LUAR DKI J~
$ ODP                  <dbl> 2372, 1005, 274, 56, 37, 26, 24, 17, 17~
$ `Proses Pemantauan`  <dbl> 499, 94, 63, 50, 0, 14, 9, 14, 15, 11, ~
$ `Selesai Pemantauan` <dbl> 1873, 911, 211, 6, 37, 12, 15, 3, 2, 5,~
$ PDP                  <dbl> 1189, 213, 176, 3, 6, 10, 2, 8, 2, 9, 6~
$ `Masih Dirawat`      <dbl> 822, 126, 123, 3, 5, 9, 2, 8, 2, 8, 6, ~
$ `Pulang dan Sehat`   <dbl> 367, 87, 53, 0, 1, 1, 0, 0, 0, 1, 0, 4,~
$ POSITIF              <dbl> 794, 154, 106, 4, 3, 6, 2, 13, 0, 8, 0,~
$ Dirawat              <dbl> 481, 130, 62, 3, 1, 3, 2, 12, 0, 5, 0, ~
$ Sembuh               <dbl> 51, 2, 5, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,~
$ Meninggal            <dbl> 88, 8, 10, 0, 2, 1, 0, 0, 0, 1, 0, 1, 0~
$ `Self Isolation`     <dbl> 174, 14, 29, 0, 0, 2, 0, 1, 0, 2, 0, 1,~
$ Keterangan           <dbl> 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

glimpse() reveals to us that there is a duplicate ID_KEL column - which our read_xlsx() function then auto-assigns with a ‘…n’, like so:

This duplication extends from March 2020 to June 2020, after which another duplicate occurs in July 2020:

Show code
# reads the 31st July 2020 .xlsx aspatial data and stores it as jul2020
jul2020 <- read_xlsx("data/aspatial/Standar Kelurahan Data Corona (31 Juli 2020 Pukul 09.00).xlsx")
glimpse(jul2020)
Rows: 270
Columns: 28
$ ID_KEL                  <chr> NA, "3172051003", "3173041007", "317~
$ Nama_provinsi           <chr> NA, "DKI JAKARTA", "DKI JAKARTA", "D~
$ nama_kota               <chr> NA, "JAKARTA UTARA", "JAKARTA BARAT"~
$ nama_kecamatan          <chr> NA, "PADEMANGAN", "TAMBORA", "KRAMAT~
$ nama_kelurahan          <chr> "TOTAL", "ANCOL", "ANGKE", "BALE KAM~
$ SUSPEK                  <dbl> 58603, 164, 202, 74, 34, 111, 91, 63~
$ `Perawatan RS`          <dbl> 1617, 8, 4, 9, 0, 0, 1, 1, 6, 0, 10,~
$ `Isolasi di Rumah...8`  <dbl> 2229, 28, 7, 8, 0, 1, 0, 0, 11, 0, 1~
$ `Suspek Meninggal`      <dbl> 2226, 0, 2, 1, 1, 1, 2, 1, 0, 1626, ~
$ `Selesai Isolasi...10`  <dbl> 52531, 128, 189, 56, 33, 109, 88, 61~
$ PROBABLE                <dbl> 628, 0, 0, 0, 0, 0, 0, 0, 1, 587, 0,~
$ `Probable Meninggal`    <dbl> 590, 0, 0, 0, 0, 0, 0, 0, 1, 586, 0,~
$ `Selesai Isolasi...13`  <dbl> 38, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,~
$ `PELAKU PERJALANAN`     <dbl> 2004, 7, 0, 3, 0, 2, 18, 2, 1, 116, ~
$ `Isolasi di Rumah...15` <dbl> 134, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
$ `Selesai Isolasi...16`  <dbl> 1870, 5, 0, 2, 0, 2, 18, 2, 1, 116, ~
$ `KONTAK ERAT`           <dbl> 96920, 311, 296, 177, 32, 146, 376, ~
$ `Isolasi di Rumah...18` <dbl> 10389, 34, 16, 81, 1, 26, 7, 11, 112~
$ `Selesai Isolasi...19`  <dbl> 86531, 277, 280, 96, 31, 120, 369, 6~
$ DISCARDED               <dbl> 5818, 4, 22, 22, 0, 14, 6, 7, 26, 87~
$ Meninggal...21          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
$ `Selesai Isolasi...22`  <dbl> 5817, 4, 22, 22, 0, 14, 6, 7, 26, 86~
$ POSITIF                 <dbl> 21201, 46, 49, 31, 7, 45, 30, 16, 42~
$ Dirawat                 <dbl> 2183, 5, 6, 3, 1, 3, 5, 1, 0, 1209, ~
$ Sembuh                  <dbl> 13208, 38, 36, 21, 5, 21, 23, 13, 31~
$ Meninggal...26          <dbl> 836, 0, 2, 1, 1, 0, 2, 0, 3, 88, 2, ~
$ `Self Isolation`        <dbl> 4974, 3, 5, 6, 0, 21, 0, 2, 8, 2396,~
$ Keterangan              <dbl> 0, NA, NA, NA, NA, NA, NA, NA, NA, N~

This time, there are duplicates of the Isolasi di Rumah, Selasi Isolasi and Meninggal columns, and this duplication extends till August 2021. While this might seem concerning, we’ll address this in the following sections.

4.2 Creating an Aspatial Data Pre-processing Function

Firstly, let us list down the requirements for our aspatial data:

Our columns of interest (those that are relevant to our analysis) are as follows:

Now that we know our requirements, we can do this step-by-step: importing all the files into one df, retaining the necessary columns (and deleting duplicate columns), then adding the date column. Alternatively, we can combine all of these into a function!

Show code
# takes in an aspatial data filepath and returns a processed output
aspatial_preprocess <- function(filepath){
  # the .name_repair="minimal" is to indicate not to replace duplicate column names with 'NAME...n' like we saw above!
  # reference: https://readxl.tidyverse.org/articles/articles/column-names.html
  result_file <- read_xlsx(filepath, .name_repair="minimal")
  
  # Remove Duplicate Columns
  # essentially, it's saying "take the columns that aren't duplicated, and if there are duplicates, take the last duplicate"
  # we use fromLast = TRUE to ensure that we keep the last duplicated column
  # reference: https://www.marsja.se/how-to-remove-duplicates-in-r-rows-columns-dplyr/
  # reference: https://stackoverflow.com/questions/16905425/find-duplicate-values-in-r
  result_file <- result_file[, !duplicated(colnames(result_file), fromLast = TRUE)] 
  
  # Create the Date Column
  # the format of our files is: Standar Kelurahan Data Corona (DD Month YYYY Pukul Ti.me)
  # while the start is technically "(", "(" is part of a regular expression and leads to a warning message, so we'll use "Corona" instead. The [[1]] refers to the first element in the list.
  # we're loading it as DD-Month-YYYY format
  # the end is 1 space before "Pukul", which means we have to -2 spaces (one for P, one for space)
  # as such, the most relevant functions are substr (returns a substring) and either str_locate (returns location of substring as an integer matrix) or gregexpr (returns a list of locations of substring)
  # reference https://stackoverflow.com/questions/14249562/find-the-location-of-a-character-in-string
  startpoint <- gregexpr(pattern="Corona", filepath)[[1]] + 8
  endpoint <- gregexpr(pattern="Pukul", filepath)[[1]] - 2
  result_file$Date <- substr(filepath, startpoint, endpoint)
  
  # Retain the Relevant Columns
  result_file <- result_file %>% 
    select("Date", 
           "ID_KEL", 
           "Nama_provinsi", 
           "nama_kota", 
           "nama_kecamatan", 
           "nama_kelurahan", 
           "POSITIF", 
           "Meninggal")
  return(result_file)
}

4.3 Feeding Files into our aspatial_preprocess function

Now that we have our custom function to preprocess aspatial data, we need to feed file into it. There’s the option of doing it manually, inputting our file names line by line - but with a handy function called list.files() and lapply() (that applies a function to all elements in a list), the process can be cut down to a few lines!

Show code
# in the folder 'data/aspatial', find files with the extension '.xlsx' and add it to our fileslist 
# the full.names=TRUE prepends the directory path to the file names, giving a relative file path - otherwise, only the file names (not the paths) would be returned 
# reference: https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.files.html
fileslist <-list.files(path = "data/aspatial", pattern = "*.xlsx", full.names=TRUE)

# afterwards, for every element in fileslist, apply aspatial_process function
dflist <- lapply(seq_along(fileslist), function(x) aspatial_preprocess(fileslist[x]))

Lastly, we’ll need to convert the dflist into an actual dataframe with ldply(), like so:

Show code
cases_jakarta <- ldply(dflist, data.frame)

Let’s check what cases_jakarta looks like, and make sure the columns are correct:

Show code
glimpse(cases_jakarta)
Rows: 4,590
Columns: 8
$ Date           <chr> "28 Februari 2021", "28 Februari 2021", "28 F~
$ ID_KEL         <chr> NA, "3172051003", "3173041007", "3175041005",~
$ Nama_provinsi  <chr> NA, "DKI JAKARTA", "DKI JAKARTA", "DKI JAKART~
$ nama_kota      <chr> NA, "JAKARTA UTARA", "JAKARTA BARAT", "JAKART~
$ nama_kecamatan <chr> NA, "PADEMANGAN", "TAMBORA", "KRAMAT JATI", "~
$ nama_kelurahan <chr> "TOTAL", "ANCOL", "ANGKE", "BALE KAMBANG", "B~
$ POSITIF        <dbl> 339735, 834, 617, 755, 358, 870, 811, 1038, 1~
$ Meninggal      <dbl> 5478, 9, 8, 15, 8, 13, 10, 10, 25, 12, 22, 30~

4.4 Formatting Date Column

Since the values in the Date column were derived from substrings, they’re naturally in string format. We should convert that into datetime, keeping in mind that the values in Date are in Bahasa Indonesia

Show code
# parses the 'Date' column into Month(Full Name)-YYYY datetime objects
# reference: https://stackoverflow.com/questions/53380650/b-y-date-conversion-gives-na

# locale="ind" means that the locale has been set as Indonesia
Sys.setlocale(locale="ind")
[1] "LC_COLLATE=Indonesian_Indonesia.1252;LC_CTYPE=Indonesian_Indonesia.1252;LC_MONETARY=Indonesian_Indonesia.1252;LC_NUMERIC=C;LC_TIME=Indonesian_Indonesia.1252"
Show code
cases_jakarta$Date <- c(cases_jakarta$Date) %>% 
  as.Date(cases_jakarta$Date, format ="%d %B %Y")

glimpse(cases_jakarta)
Rows: 4,590
Columns: 8
$ Date           <date> 2021-02-28, 2021-02-28, 2021-02-28, 2021-02-~
$ ID_KEL         <chr> NA, "3172051003", "3173041007", "3175041005",~
$ Nama_provinsi  <chr> NA, "DKI JAKARTA", "DKI JAKARTA", "DKI JAKART~
$ nama_kota      <chr> NA, "JAKARTA UTARA", "JAKARTA BARAT", "JAKART~
$ nama_kecamatan <chr> NA, "PADEMANGAN", "TAMBORA", "KRAMAT JATI", "~
$ nama_kelurahan <chr> "TOTAL", "ANCOL", "ANGKE", "BALE KAMBANG", "B~
$ POSITIF        <dbl> 339735, 834, 617, 755, 358, 870, 811, 1038, 1~
$ Meninggal      <dbl> 5478, 9, 8, 15, 8, 13, 10, 10, 25, 12, 22, 30~

4.4 Renaming Columns with Translation

Just like what we did in 3.6, we’ll translate the column names to English for ease of comprehension:

Show code
# renames the columns in the style New_Name = OLD_NAME
cases_jakarta <- cases_jakarta %>% 
  dplyr::rename(
    Date=Date,
    ID=ID_KEL,
    Province=Nama_provinsi, 
    City=nama_kota, 
    District=nama_kecamatan, 
    Sub_District=nama_kelurahan,
    Cases=POSITIF, 
    Deaths=Meninggal
    )

4.5 Further Data Processing

Now that we have our confirmed dataframe, let’s execute any pre-processing we might have missed. Firstly, let’s check for missing values:

Show code
# returns rows that contain NA values
cases_jakarta[rowSums(is.na(cases_jakarta))!=0,]
           Date   ID Province City District Sub_District  Cases
1    2021-02-28 <NA>     <NA> <NA>     <NA>        TOTAL 339735
271  2020-04-30 <NA>     <NA> <NA>     <NA>        TOTAL   4138
541  2021-04-30 <NA>     <NA> <NA>     <NA>        TOTAL 408620
811  2021-01-30 <NA>     <NA> <NA>     <NA>        TOTAL 266244
1081 2020-06-30 <NA>     <NA> <NA>     <NA>        TOTAL  11276
1351 2021-06-30 <NA>     <NA> <NA>     <NA>        TOTAL 543468
1621 2020-11-30 <NA>     <NA> <NA>     <NA>        TOTAL 136861
1891 2020-09-30 <NA>     <NA> <NA>     <NA>        TOTAL  74368
2161 2020-08-31 <NA>     <NA> <NA>     <NA>         <NA>  40309
2431 2020-12-31 <NA>     <NA> <NA>     <NA>        TOTAL 183735
2701 2020-07-31 <NA>     <NA> <NA>     <NA>        TOTAL  21201
2971 2021-07-31 <NA>     <NA> <NA>     <NA>        TOTAL 814653
3241 2020-03-31 <NA>     <NA> <NA>     <NA>        TOTAL    794
3511 2021-03-31 <NA>     <NA> <NA>     <NA>        TOTAL 382055
3781 2020-05-31 <NA>     <NA> <NA>     <NA>        TOTAL   7272
4051 2021-05-31 <NA>     <NA> <NA>     <NA>        TOTAL 430059
4321 2020-10-31 <NA>     <NA> <NA>     <NA>        TOTAL 105597
     Deaths
1      5478
271     381
541    6733
811    4254
1081    641
1351   8486
1621   2671
1891   1731
2161   1202
2431   3287
2701    836
2971  12135
3241     88
3511   6341
3781    520
4051   7327
4321   2255

Since those rows have missing values from their ID all the way to their sub-districts, they should be removed accordingly:

Show code
# removes rows that have an NA value in ID
cases_jakarta <- na.omit(cases_jakarta,c("ID"))

In addition, a brief inspection of cases_jakarta reveals abnormal values for the ID column: specifically, BELUM DIKETAHUI, PROSES UPDATE DATA and LUAR DKI JAKARTA. Let’s remove them:

Show code
# removes the rows where ID is BELUM DIKETAHUI, PROSES UPDATE DATA and LUAR DKI JAKARTA
cases_jakarta <- cases_jakarta[!(cases_jakarta$ID=="BELUM DIKETAHUI" | 
                                   cases_jakarta$ID=="PROSES UPDATE DATA" | 
                                   cases_jakarta$ID=="LUAR DKI JAKARTA") ,]

Lastly, we want to remove any records from the Outer Islands where the City is, like these:

Show code
# filters out the outer islands by accepting only if the value of City is NOT KAB.ADM.KEP.SERIBU
cases_jakarta <- filter(cases_jakarta, City != "KAB.ADM.KEP.SERIBU")

With that, we’re done with the data importing and pre-processing! Let’s move on into the next section 💪

5.0 Geospatial Data Integration

5.1 Preliminary joining + EDA

Now that we have both the geospatial and aspatial data frames, we’ll need to join them. A quick look at their headers tell us what their common fields are:

Show code
# checks for column names of the dataframes
colnames(bd_jakarta)
 [1] "Object_ID"        "Village_Code"     "Village"         
 [4] "Code"             "Province"         "City"            
 [7] "District"         "Sub_District"     "Total_Population"
[10] "geometry"        
Show code
colnames(cases_jakarta)
[1] "Date"         "ID"           "Province"     "City"        
[5] "District"     "Sub_District" "Cases"        "Deaths"      

It seems that the Province, City, Distict and Sub_District should match up. Let’s try doing that first:

Show code
# joins cases_jakarta to bd_jakarta based on Province, Sub_District and City
combined_jakarta <- left_join(bd_jakarta, cases_jakarta,
                              by=c(
                                "Province"="Province", 
                                "Sub_District"="Sub_District",
                                "City"="City")
                              )

Now, let’s visualise our current combined_jakarta in terms of cases and deaths:

Show code
# maps the Cases and Deaths side-by-side
prelim_cases = tm_shape(combined_jakarta)+
  tm_fill("Cases") +
  tm_borders(alpha = 0.5) +
  tm_layout(main.title="Preliminary Cases Count")

prelim_deaths = tm_shape(combined_jakarta)+
  tm_fill("Deaths") +
  tm_borders(alpha = 0.5) +
  tm_layout(main.title="Preliminary Death Count")

tmap_arrange(prelim_cases, prelim_deaths)

Hmm, that’s not quite right. There are still some ‘missing’ values, but we’ve taken care to remove NA records from both sets of data. The most likely reason for this is due to mistmatched records (one says Berenstein while the other says Berenstain, for example) - which requires addition tweaking and investigation.

5.2 Identifying Mismatched Sub-District Records

A brief comparison of the geospatial vs aspatial values reveals to us the mismatch: for the Sub_District field, there isn’t standardisation of the sub-district names, which leads to the same sub-district being referred to by slightly different names, such as KRAMAJATI vs KRAMAT JATI. This is why these records appear as ‘missing values’. To find all of these discrepancies, we should look for sub-district names that appear in our cases_jakarta that don’t appear in bd_jakarta, like so:

Show code
# checks for unique values of Sub_District in cases_jakarta that aren't already present in bd_jakarta and vice versa
cases_subdistrict <- c(cases_jakarta$Sub_District)
bd_subdistrict <- c(bd_jakarta$Sub_District)

unique(cases_subdistrict[!(cases_subdistrict %in% bd_subdistrict)])
 [1] "BALE KAMBANG"          "HALIM PERDANA KUSUMAH"
 [3] "JATI PULO"             "KALI BARU"            
 [5] "KAMPUNG TENGAH"        "KERENDANG"            
 [7] "KRAMAT JATI"           "PAL MERIAM"           
 [9] "PINANG RANTI"          "RAWA JATI"            
Show code
unique(bd_subdistrict[!(bd_subdistrict %in% cases_subdistrict)])
 [1] "KALIBARU"             "KRENDANG"            
 [3] "RAWAJATI"             "TENGAH"              
 [5] "BALEKAMBANG"          "PINANGRANTI"         
 [7] "JATIPULO"             "PALMERIAM"           
 [9] "KRAMATJATI"           "HALIM PERDANA KUSUMA"
Show code
# initialise a dataframe of our cases vs bd subdistrict spelling
spelling <- data.frame(
  Aspatial_Cases=c("BALE KAMBANG", "HALIM PERDANA KUSUMAH", "JATI PULO", "KALI BARU", "KAMPUNG TENGAH", "KRAMAT JATI", "KERENDANG", "PAL MERIAM", "PINANG RANTI", "RAWA JATI"),
  Geospatial_BD=c("BALEKAMBAG", "HALIM PERDANA KUSUMA", "JATIPULO", "KALIBARU", "TENGAH", "KRAMATJATI", "KRENDANG", "PALMERIAM", "PINANGRANTI", "RAWAJATI")
  )

# with dataframe a input, outputs a kable
library(knitr)
library(kableExtra)
kable(spelling, caption="Mismatched Records") %>%
  kable_material("hover", latex_options="scale_down")
Table 2: Mismatched Records
Aspatial_Cases Geospatial_BD
BALE KAMBANG BALEKAMBAG
HALIM PERDANA KUSUMAH HALIM PERDANA KUSUMA
JATI PULO JATIPULO
KALI BARU KALIBARU
KAMPUNG TENGAH TENGAH
KRAMAT JATI KRAMATJATI
KERENDANG KRENDANG
PAL MERIAM PALMERIAM
PINANG RANTI PINANGRANTI
RAWA JATI RAWAJATI

5.3 Correcting Mismatched Sub-District Records

Now that we know which sub-district records are mismatched, we need to rectify the mismatches by renaming them:

Show code
# where bd_jakarta is a mismatched value, replace with the correct value
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'BALEKAMBANG'] <- 'BALE KAMBANG'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'HALIM PERDANA KUSUMA'] <- 'HALIM PERDANA KUSUMAH'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'JATIPULO'] <- 'JATI PULO'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'KALIBARU'] <- 'KALI BARU'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'TENGAH'] <- 'KAMPUNG TENGAH'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'KRAMATJATI'] <- 'KRAMAT JATI'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'KRENDANG'] <- 'KERENDANG'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'PALMERIAM'] <- 'PAL MERIAM'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'PINANGRANTI'] <- 'PINANG RANTI'
bd_jakarta$Sub_District[bd_jakarta$Sub_District == 'RAWAJATI'] <- 'RAWA JATI'

5.4 Joining + EDA

Now, we have a standardised common identifier among our geospatial and aspatial dataframes. Let’s join them once more:

Show code
# joins cases_jakarta to bd_jakarta based on Sub_District
combined_jakarta <- left_join(bd_jakarta, cases_jakarta,
                              by=c("Sub_District"="Sub_District")
                              )

Now, let’s once again visualise our updated combined_jakarta in terms of cases and deaths:

Show code
# maps the Cases and Deaths side-by-side
updated_cases = tm_shape(combined_jakarta)+
  tm_fill("Cases") +
  tm_borders(alpha = 0.5) +
  tm_layout(main.title="Updated Cases Count")

updated_deaths = tm_shape(combined_jakarta)+
  tm_fill("Deaths") +
  tm_borders(alpha = 0.5) +
  tm_layout(main.title="Updated Death Count")

tmap_arrange(updated_cases, updated_deaths)

5.5 Calculations

Before we move into into EDA and thematic mapping, we need to calculate the cumulative confirmed cases rate (i.e. cases per 10000 population) and the cumulative death rate by month, as per assignment requirements.

5.5.1 Cumulative Confirmed Cases Rate

The cumulative confirmed cases rate can be calculated by summing up the number of cases (based on the name of the sub-district and the date), and then dividing by the total population - but do note that the total population represented in our dataframe represents 10,000 denominations (in other words, this is number of cases for every 10,000 people). This can easily be achieved with pivot_wider().

Show code
# grouping based on the sub-district and date
# the cumulative_case_rate is based on the sum of cases over the total population
cases_rate <- cases_jakarta %>%
  inner_join(bd_jakarta, by=c("Sub_District" = "Sub_District")) %>%
  group_by(Sub_District, Date) %>%
  dplyr::summarise(`cumulative_case_rate` = ((sum(Cases)/(Total_Population))*10000)) %>%
  
  #afterwards, pivots the table based on the Dates, using the cumulative case rate as the values
  ungroup() %>% pivot_wider(names_from = Date,
              values_from = cumulative_case_rate)

Our case_rate will look like this:

5.5.2 Cumulative Death Rate

Assumption: Cumulative Death Rate refers to the total number of deaths, divided by population. For total number of deaths over confirmed cases, also known as the Case Fatality Ratio, please refer to section 5.5.3 below!

Show code
# grouping based on the sub-district and date
# the cumulative_case_rate is based on the sum of deaths over the total population
death_rate <- cases_jakarta %>%
  inner_join(bd_jakarta, by=c("Sub_District" = "Sub_District")) %>%
  group_by(Sub_District, Date) %>%
  dplyr::summarise(`cumulative_death_rate` = ((sum(Deaths)/(Total_Population))*10000)) %>%
  
  #afterwards, pivots the table based on the Dates, showing the cumulative death rate
  ungroup() %>% pivot_wider(names_from = Date,
              values_from = cumulative_death_rate)

5.5.3 Case Fatality Ratio

Case Fatality Ratio, also known as CFR is the proportion of individuals diagnosed with a disease who die from that disease. Within this context, it refers to the number of people who contacted COVID-19 and passed away. The CFR is used as a measure of severity among detected cases.

The Case Fatality Ratio can be represented as such: \[ \text{Case Fatality ratio (CFR, in%)} = \frac{\text{Number of deaths from disease}}{\text{Number of confirmed cases of disease}} * 100 \]

Show code
# grouping based on the sub-district and date
# the case_fatality_ratio is based on the sum of deaths over confirmed cases
# note that the CFR is in % form, whcih is why we *100
CFR <- cases_jakarta %>%
  inner_join(bd_jakarta, by=c("Sub_District" = "Sub_District")) %>%
  group_by(Sub_District, Date) %>%
  dplyr::summarise(`case_fatality_ratio` = ((sum(Deaths)/sum(Cases))*100)) %>%
  
  #afterwards, pivots the table based on the Dates, showing the cumulative death rate
  ungroup() %>% pivot_wider(names_from = Date,
              values_from = case_fatality_ratio)

Note that for CFR, there are some NaN values on account of the case rate (our denominator) being zero. To accurately reflect the CFR, we will not remove those rows or transform from NaN to 0: on a map, they should reflect as ‘missing values’.

5.6 Converting dataframes to sf objects

Before we move on into the mapping, we should convert these dataframes into sf objects.

Show code
combined_jakarta <- st_as_sf(combined_jakarta)

# need to join our previous dataframes with the geospatial data to ensure that geometry column is present
cases_rate <- cases_rate%>% left_join(bd_jakarta, by=c("Sub_District"="Sub_District"))
cases_rate <- st_as_sf(cases_rate)

death_rate <- death_rate%>% left_join(bd_jakarta, by=c("Sub_District"="Sub_District"))
death_rate <- st_as_sf(death_rate)

CFR <- CFR%>% left_join(bd_jakarta, by=c("Sub_District"="Sub_District"))
CFR <- st_as_sf(CFR)

6.0 Mapping: Monthly Cumulative Cases Rate

6.1 Jenks Choropleth Map

I’m planning to use the jenks classification method. It ‘seeks to minimize the average deviation from the class mean while maximizing the deviation from the means of the other groups’ (source) and tends to identify real, natural classes within the data. However, jenks will not work as well if the data has a low variance, so let’s check the variance:

Show code
# commented out most of the variances due to the increasing variance over time
var(cases_rate$`2020-03-31`)
[1] 10.14244
Show code
# var(cases_rate$`2020-04-30`)
# var(cases_rate$`2020-05-31`)
# var(cases_rate$`2020-06-30`)
# var(cases_rate$`2020-07-31`)
# var(cases_rate$`2020-08-31`)
# var(cases_rate$`2020-09-30`)
# var(cases_rate$`2020-10-31`)
# var(cases_rate$`2020-11-30`)
# var(cases_rate$`2020-12-31`)
# var(cases_rate$`2021-01-30`)
# var(cases_rate$`2021-02-28`)
# var(cases_rate$`2021-03-31`)
# var(cases_rate$`2021-04-30`)
# var(cases_rate$`2021-05-31`)
# var(cases_rate$`2021-06-30`)
var(cases_rate$`2021-07-31`)
[1] 105409.9

Since the variance is increasing over time and seems significant, we can continue with jenks classification. After some testing, having 6 classes seems to be the optimum: too many classes and it becomes hard for the human eye to differentiate between the gradients, while too few classes makes it hard for any differentiation to be seen.

Show code
library(tmap)

# using the jenks method, with 6 classes
tmap_mode("plot")
tm_shape(cases_rate)+
  tm_fill("2020-03-31", 
          n= 6,
          style = "jenks", 
          title = "Case Rate") +
  tm_layout(main.title = "Distribution of COVID-19 Case Rate in March 2020",
            main.title.position = "center",
            main.title.size = 1,
            legend.height = 0.5, 
            legend.width = 0.4,
            frame = TRUE) +
  tm_borders(alpha = 0.5)

We should plot it for all the months - so let’s have a helper function for plotting this!

Show code
# input: the dataframe and the variable name - in this case, the month 
# with style="jenks" for the jenks classification method
jenks_plot <- function(df, varname) {
  tm_shape(cases_rate) +
    tm_polygons() +
  tm_shape(df) +
    tm_fill(varname, 
          n= 6,
          style = "jenks", 
          title = "Case Rate") +
    tm_layout(main.title = varname,
          main.title.position = "center",
          main.title.size = 1.2,
          legend.height = 0.45, 
          legend.width = 0.35,
          frame = TRUE) +
    tm_borders(alpha = 0.5)
}

Now, let’s visualise the jenks plots for all months:

Show code
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(jenks_plot(cases_rate, "2020-03-31"),
             jenks_plot(cases_rate, "2020-04-30"),
             jenks_plot(cases_rate, "2020-05-31"),
             jenks_plot(cases_rate, "2020-06-30"))
Show code
tmap_arrange(jenks_plot(cases_rate, "2020-07-31"),
             jenks_plot(cases_rate, "2020-08-31"),
             jenks_plot(cases_rate, "2020-09-30"),
             jenks_plot(cases_rate, "2020-10-31"))
Show code
tmap_arrange(jenks_plot(cases_rate, "2020-11-30"),
             jenks_plot(cases_rate, "2020-12-31"),
             jenks_plot(cases_rate, "2021-01-30"),
             jenks_plot(cases_rate, "2021-02-28"))
Show code
tmap_arrange(jenks_plot(cases_rate, "2021-03-31"),
             jenks_plot(cases_rate, "2021-04-30"),
             jenks_plot(cases_rate, "2021-05-31"),
             jenks_plot(cases_rate, "2021-06-30"))
Show code
tmap_arrange(jenks_plot(cases_rate, "2021-07-31"))

6.2 Observations from Jenks Choropleth Map

Do note that each map has its own relative case rate: the ranges gradually grow larger over time with the greater influx of cases. By comparing the change of case rates over the months, there are a number of observations we can make:

Let us check for the sub-districts with the highest cases rate at the various stages:

Show code
# to check for darkest sub-district in early stages
cases_rate$Village[which.max(cases_rate$`2020-03-31`)]
[1] "SENAYAN"
Show code
# to check for darkest sub-district in the later stages
cases_rate$Village[which.max(cases_rate$`2020-08-31`)]
[1] "GAMBIR"
Show code
# reorder based on decreasing values for Feb 2021, and assign to a new df
cases_rate_descending <- cases_rate[order(cases_rate$`2021-02-28`,decreasing = TRUE),]
cases_rate_descending_top5 <- cases_rate_descending[1:5,]
cases_rate_descending_top5[,"Sub_District"]
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -3630194 ymin: 679978.2 xmax: -3623456 ymax: 690083.2
Projected CRS: DGN95 / Indonesia TM-3 zone 54.1
# A tibble: 5 x 2
  Sub_District                                                geometry
  <chr>                                             <MULTIPOLYGON [m]>
1 GAMBIR         (((-3624851 689984.3, -3624853 689959.1, -3624852 68~
2 SENAYAN        (((-3626081 682271.9, -3626092 682263.2, -3626128 68~
3 KUNINGAN TIMUR (((-3623728 682537.6, -3623715 682486.2, -3623685 68~
4 GUNUNG         (((-3628388 681746.1, -3628389 681733.1, -3628387 68~
5 SETIA BUDI     (((-3624919 685319, -3624920 685274.2, -3624915 6852~

As we can see, the top 5 sub-districts with the highest cases rate are:

6.3 Spatio-Temporal Mapping with custom breakpoints

In the above section, we saw each map by their relative case rate - but to see the spatio-temporal progression of case rates, we should set a fixed range. In this case, customising our breakpoints would be ideal - just like above, we’ll be breaking into 6 classes.

Firstly, let’s find the maximum of the case_rates - since the highest case rate is logically in the latest month (July 2021), we’ll find the maximum case_rate from there:

Show code
# gets maximum value (case rate) of July 2021 from cases_rate df
max(cases_rate$`2021-07-31`)
[1] 3808.29

Now that we know our maximum number, let’s set our breakpoints:

Show code
# these breakpoints are chosen based on the jenks breaks ranges from the previous section
breakpoints = c(0, 4, 20, 49, 112, 550, 3808)

This time, let’s test it out on July 2021:

Show code
library(tmap)
tmap_mode("plot")
tm_shape(cases_rate)+
  tm_fill("2021-07-31", 
          breaks= c(0, 4, 20, 49, 112, 1105, 3808),
          title = "Case Rate") +
  tm_layout(main.title = "Distribution of COVID-19 Case Rate in July 2021 (Spatio-Temporal)",
            main.title.position = "center",
            main.title.size = 1,
            legend.height = 0.5, 
            legend.width = 0.4,
            frame = TRUE) +
  tm_borders(alpha = 0.5)

It might seem pretty skewed towards the ‘heavier’ end, but this is to be expected: after all, this is the end of our dataset where COVID-19 has had over a year of presence in Jakarta. Once again, let’s create a helper function to help us:

Show code
break_plot <- function(df, varname) {
  tm_shape(cases_rate) +
    tm_polygons() +
  tm_shape(df) +
    tm_fill(varname, 
          breaks= c(0, 4, 20, 49, 112, 550, 3808),
          title = "Case Rate") +
    tm_layout(main.title = varname,
          main.title.position = "center",
          main.title.size = 1.2,
          legend.height = 0.45, 
          legend.width = 0.35,
          frame = TRUE) +
    tm_borders(alpha = 0.5)
}

Now, let’s visualise the custom break plots for all months:

Show code
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(break_plot(cases_rate, "2020-03-31"),
             break_plot(cases_rate, "2020-04-30"),
             break_plot(cases_rate, "2020-05-31"),
             break_plot(cases_rate, "2020-06-30"))
Show code
tmap_arrange(break_plot(cases_rate, "2020-07-31"),
             break_plot(cases_rate, "2020-08-31"),
             break_plot(cases_rate, "2020-09-30"),
             break_plot(cases_rate, "2020-10-31"))
Show code
tmap_arrange(break_plot(cases_rate, "2020-11-30"),
             break_plot(cases_rate, "2020-12-31"),
             break_plot(cases_rate, "2021-01-30"),
             break_plot(cases_rate, "2021-02-28"))
Show code
tmap_arrange(break_plot(cases_rate, "2021-03-31"),
             break_plot(cases_rate, "2021-04-30"),
             break_plot(cases_rate, "2021-05-31"),
             break_plot(cases_rate, "2021-06-30"))
Show code
tmap_arrange(break_plot(cases_rate, "2021-07-31"))

6.4 Observations from Spatio-Temporal Map

Made by exporting .pngs of individual plots into an online gifmaker

Now, we can supplement our observations in 6.2 with a spatio-temporal dimension! Here are the observations made:

7.0 Mapping: Monthly Cumulative Death Rate

7.1 Jenks Choropleth Map

Like in our monthly cumulative cases in 6.1, we’ll be using the jenks classification method for our choropleth map, but we need to check the variance first:

Show code
var(death_rate$`2020-03-31`)
[1] 0.07458528
Show code
var(death_rate$`2020-04-30`)
[1] 0.3333565
Show code
var(death_rate$`2020-05-31`)
[1] 0.387508
Show code
var(death_rate$`2020-06-30`)
[1] 0.4573607
Show code
var(death_rate$`2020-07-31`)
[1] 0.5275186
Show code
var(death_rate$`2020-08-31`)
[1] 0.7535532
Show code
var(death_rate$`2020-09-30`)
[1] 0.990763
Show code
var(death_rate$`2020-10-31`)
[1] 1.156742
Show code
var(death_rate$`2020-11-30`)
[1] 1.313623
Show code
var(death_rate$`2020-12-31`)
[1] 1.787521
Show code
var(death_rate$`2021-01-30`)
[1] 3.001775
Show code
var(death_rate$`2021-02-28`)
[1] 4.046581
Show code
var(death_rate$`2021-03-31`)
[1] 5.269923
Show code
var(death_rate$`2021-04-30`)
[1] 5.513535
Show code
var(death_rate$`2021-05-31`)
[1] 6.851983
Show code
var(death_rate$`2021-06-30`)
[1] 7.674798
Show code
var(death_rate$`2021-07-31`)
[1] 13.12748

The low variance for the initial stages might seem concerning, but it increases over time to a significant level for majority of the data, so we’ll continue with this. In addition, we’ll be using custom breakpoints in the next section for our spatio-temporal analysis, which will supplement our observations.

We’ll need to tweak our jenks_plot helper function a bit due to the class titles:

Show code
# input: the dataframe and the variable name - in this case, the month 
# with style="jenks" for the jenks classification method
# minor tweak to tm_fill - changed title and palette
jenks_plot <- function(df, varname) {
  tm_shape(cases_rate) +
    tm_polygons() +
  tm_shape(df) +
    tm_fill(varname, 
          n= 6,
          style = "jenks", 
          title = "Death Rate",
          palette="Blues") +
    tm_layout(main.title = varname,
          main.title.position = "center",
          main.title.size = 1.2,
          legend.height = 0.45, 
          legend.width = 0.35,
          frame = TRUE) +
    tm_borders(alpha = 0.5)
}

Now, let’s visualise the jenks plots for all months:

Show code
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(jenks_plot(death_rate, "2020-03-31"),
             jenks_plot(death_rate, "2020-04-30"),
             jenks_plot(death_rate, "2020-05-31"),
             jenks_plot(death_rate, "2020-06-30"))
Show code
tmap_arrange(jenks_plot(death_rate, "2020-07-31"),
             jenks_plot(death_rate, "2020-08-31"),
             jenks_plot(death_rate, "2020-09-30"),
             jenks_plot(death_rate, "2020-10-31"))
Show code
tmap_arrange(jenks_plot(death_rate, "2020-11-30"),
             jenks_plot(death_rate, "2020-12-31"),
             jenks_plot(death_rate, "2021-01-30"),
             jenks_plot(death_rate, "2021-02-28"))
Show code
tmap_arrange(jenks_plot(death_rate, "2021-03-31"),
             jenks_plot(death_rate, "2021-04-30"),
             jenks_plot(death_rate, "2021-05-31"),
             jenks_plot(death_rate, "2021-06-30"))
Show code
tmap_arrange(jenks_plot(death_rate, "2021-07-31"))

7.2 Observations from Jenks Choropleth Map

Like before, do note that each map has its own relative death rate: the ranges gradually grow larger over time with the greater influx of deaths By comparing the change of death rates over the months, there are a number of observations we can make:

Let us check for the sub-districts with the highest cases rate at the various stages:

Show code
# to check for darkest sub-district in early stages
death_rate$Village[which.max(death_rate$`2020-03-31`)]
[1] "KARET SEMANGGI"
Show code
# to check for darkest sub-district in the later stages
death_rate$Village[which.max(death_rate$`2021-02-28`)]
[1] "GAMBIR"
Show code
# reorder based on decreasing values for July 2021, and assign to a new df
death_rate_descending <- death_rate[order(death_rate$`2021-06-30`,decreasing = TRUE),]
death_rate_descending_top5 <- death_rate_descending[1:5,]
death_rate_descending_top5[,"Sub_District"]
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -3630194 ymin: 679978.2 xmax: -3623648 ymax: 690083.2
Projected CRS: DGN95 / Indonesia TM-3 zone 54.1
# A tibble: 5 x 2
  Sub_District                                                geometry
  <chr>                                             <MULTIPOLYGON [m]>
1 GAMBIR         (((-3624851 689984.3, -3624853 689959.1, -3624852 68~
2 SELONG         (((-3626336 681381.3, -3626335 681357.8, -3626355 68~
3 GUNUNG         (((-3628388 681746.1, -3628389 681733.1, -3628387 68~
4 KARET KUNINGAN (((-3624392 683871, -3624371 683747.4, -3624220 6837~
5 SETIA BUDI     (((-3624919 685319, -3624920 685274.2, -3624915 6852~

As we can see, the top 5 sub-districts with the highest death rate are:

Interestingly, despite SENAYAN having one of the highest case rates, it does not appear to be ranked as highly in the death rate. This could indicate two things: either (a) SENAYAN’s local government has pushed out and enforced successful measures involving the local community (or distribution of aid), which has helped the COVID-19-affected patients recover, or (b) the case rate is not as tightly coupled to the death rate as we initially thought. We do not have enough information to vouch for either, but it is worth for local governments on the district or provincial level to survey and check on the measures in place in SENAYAN.

7.3 Monthly Cumulative Deaths - Spatio-Temporal Mapping

In the above section, we saw each map by their relative death rate - but to see the spatio-temporal progression of death rates, we should set a fixed range. Just like for our cumulative case rate in 6.3, we should customise our breakpoints and break into 6 classes.

Firstly, let’s find the maximum of the death_rate - with reference to the previous section’s graphs, it seems that the highest death rate is in the latest month (July 2021), we’ll find the maximum death_rate from there:

Show code
# gets maximum value (case rate) of July 2021 from cases_rate df
max(death_rate$`2021-07-31`)
[1] 42.09845

Now that we know our maximum number, let’s set our breakpoints:

Show code
# these breakpoints are chosen based on the jenks breaks ranges from the previous section
breakpoints = c(0, 0.4, 1.3, 3.4, 6.5, 12, 42)

Once again, let’s tweak our previous helper function:

Show code
# minor tweak to tm_fill - change title and palette
break_plot <- function(df, varname) {
  tm_shape(cases_rate) +
    tm_polygons() +
  tm_shape(df) +
    tm_fill(varname, 
          breaks= c(0, 0.4, 1.3, 3.4, 6.5, 12, 42),
          title = "Death Rate",
          palette="Blues") +
    tm_layout(main.title = varname,
          main.title.position = "center",
          main.title.size = 1.2,
          legend.height = 0.45, 
          legend.width = 0.35,
          frame = TRUE) +
    tm_borders(alpha = 0.5)
}

Now, let’s visualise the custom break plots for all months:

Show code
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(break_plot(death_rate, "2020-03-31"),
             break_plot(death_rate, "2020-04-30"),
             break_plot(death_rate, "2020-05-31"),
             break_plot(death_rate, "2020-06-30"))
Show code
tmap_arrange(break_plot(death_rate, "2020-07-31"),
             break_plot(death_rate, "2020-08-31"),
             break_plot(death_rate, "2020-09-30"),
             break_plot(death_rate, "2020-10-31"))
Show code
tmap_arrange(break_plot(death_rate, "2020-11-30"),
             break_plot(death_rate, "2020-12-31"),
             break_plot(death_rate, "2021-01-30"),
             break_plot(death_rate, "2021-02-28"))
Show code
tmap_arrange(break_plot(death_rate, "2021-03-31"),
             break_plot(death_rate, "2021-04-30"),
             break_plot(death_rate, "2021-05-31"),
             break_plot(death_rate, "2021-06-30"))
Show code
tmap_arrange(break_plot(death_rate, "2021-07-31"))

7.4 Observations from Spatio-Temporal Map

Let’s supplement our observations in 7.2: - Similar to the case rate in 6.4, the general trend of the spread seems to (a) target neighboring districts and (b) ‘grows from the center’, albeit more patchily and with less intensity than what we saw for the case rate spatio-temporal graphs. The cross-border nature of COVID-19 naturally leads to more cases, which naturally leads to more deaths by COVID-19 complciations, thus increasing the death over population rate. - Contrasting the cases rate map, and related to the low variances of the initial months, there are paler areas all around for the first few months (March 2020~June 2020) before a a marked uptick (and darkening areas) is observed from July onwards. This is in-line with real world events: the life-threatening danger of COVID-19 was always present, but strains of new and nastier variants, alongside stretched resources, which were introduced as time went on are likely to have a hand in the higher death rate.

8.0 Mapping: Case Fatality Ratio

Onto our final mapping! Like the previous two, we’ll be using the jenks classification method for our choropleth map - but this time, we won’t be doing a spatio-temporal map, mainly becuase CFR is not cumulative and thus the map of one month does not ‘build upon’ the map of the previous.

8.1 Jenks Choropleth Map

We’ll need to tweak our jenks_plot helper function a bit due to the class titles:

Show code
# input: the dataframe and the variable name - in this case, the month 
# with style="jenks" for the jenks classification method
# minor tweak to tm_fill - changed title and palette
jenks_plot <- function(df, varname) {
  tm_shape(cases_rate) +
    tm_polygons() +
  tm_shape(df) +
    tm_fill(varname, 
          n= 6,
          style = "jenks", 
          title = "CFR",
          palette="Reds") +
    tm_layout(main.title = varname,
          main.title.position = "center",
          main.title.size = 1.2,
          legend.height = 0.45, 
          legend.width = 0.35,
          frame = TRUE) +
    tm_borders(alpha = 0.5)
}

Now, let’s visualise the jenks plots for all months:

Show code
# split it up into multiple arranges to make it easier to see
library(tmap)
tmap_mode("plot")
tmap_arrange(jenks_plot(CFR, "2020-03-31"),
             jenks_plot(CFR, "2020-04-30"),
             jenks_plot(CFR, "2020-05-31"),
             jenks_plot(CFR, "2020-06-30"))
Show code
tmap_arrange(jenks_plot(CFR, "2020-07-31"),
             jenks_plot(CFR, "2020-08-31"),
             jenks_plot(CFR, "2020-09-30"),
             jenks_plot(CFR, "2020-10-31"))
Show code
tmap_arrange(jenks_plot(CFR, "2020-11-30"),
             jenks_plot(CFR, "2020-12-31"),
             jenks_plot(CFR, "2021-01-30"),
             jenks_plot(CFR, "2021-02-28"))
Show code
tmap_arrange(jenks_plot(CFR, "2021-03-31"),
             jenks_plot(CFR, "2021-04-30"),
             jenks_plot(CFR, "2021-05-31"),
             jenks_plot(CFR, "2021-06-30"))
Show code
tmap_arrange(jenks_plot(CFR, "2021-07-31"))

7.2 Observations from Jenks Choropleth Map

By comparing the change of CFR over the months, there are a number of observations we can make:

9.0 Conclusions

As stated when we first started this analysis, our goal is to both visualise and understand the patterns shown in the data. As such, this section is to provide some background information as to the spread of the cases and their corresponding death rate - and especially why we see such a violent surge in the cases rate and death rate.

As an article discussing Indonesia’s COVID-19 Response Strategies points out, Indonesia suffered poor response time and inefficient strategies for mobilising health resources, an opinion that is reiterated in an article explaning the surge in Indonesia’s cases.

In addition, poor data-sharing between upper levels of government (country-wide, provincial) and the local governments (district, sub-district) lead to a stifling of local government actions that made it hard for them to act without data or authority - which in turn fed into the lack of community participation. This was not taken advatnage of in the earlier stages when it was easiest to suppress the cases, which in turn caused it to snowball into something critical. This is aggravated by their data management, which has 3 major flaws

Why This Matters

As the COVID-19 pandemic continues, there is an increasing need for transparency and efficiency of data sharing, especially for a country that has governmental bodies on multiple subdivisions such as Indonesia. Increasingly, local governments find themselves needing to address the need to make information available to the public, and even the UN encourages governments around the world to embrace the digital government

This data is especially significant for the finer-grained levels of local governments - our sub-distrcits. They are the closest to the people in terms of response time, capability of enforcing quarantines and most importantly distributing aid, and oftentimes are more effective and targeted than government efforts. Their sub-district data, in turn, feeds back to the higher-ups: the districts, provinces, and finally country-wide - who can then adjust their regulations, send targeted aid to particular areas, and better address the situation. In other words, this data enables the different levels of government of the country to work towards a common goal like a well-oiled machine.

Such data isn’t just limited by sharing up-to-date information, but also extends to visualising said data in an accurate and easy-to-understand format. In a situation where response time is key, it is critical to have your readers understand the data from a few key visualisations and start planning how to take action. In addition, such data is transparent to the citizens as well - giving them a gauge of the risk areas and keeping them up-to-date on the measures.

This is the significance of geospatial analytics and of this analysis: to ensure that data is accurate and understandable for all readers regardless of their background or experience, in the hopes of enabling action and keeping your readers in the loop.

Future Impact/Suggestions

While the COVID-19 pandemic is still ongoing, there are particular events that will undoubtedly lead to spikes in the cases, even with heavy government intervention. For example, a number of citizens circumvented the ban on mudik to visit their for the Hari Raya Aidilfitri holidays -

Additionally distinctions are made between the variant/strain of COVID-19 cases (alpha to delta), I believe we can create a more accurate analysis with more potential information to be gleaned. This is in light of the discovery of the Delta Variant of COVID-19, where its fast spread between sub-districts implies that we should have a separate analysis tracking its spread so as to best contain it and deliver aid to the affected sub-districts.

10.0 Acknowledgements

Thank you Prof. Kam for all the lecture notes and hands-on exercises - I took a great deal of reference from them, from data wrangling to mapping!

Resources Used

Here are the list of resources used in this analysis, as well as their links. Special thanks go to the Senior Sample Submissions that Prof. Kam recommended! ❤️