Background

Working with data from INCA or Rockan can be a pain! Not only are some formats strange (such as Boolean and dates), sometimes the formats also differ internally in INCA compared to after exportation. The incadata package is aimed to streamline the process of reading and using RCC data (from INCA and Rockan).

Example data

This vignette will use some example data ex_data found in the package:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(incadata)
## Loading required package: decoder
## 
## Attaching package: 'incadata'
## The following object is masked from 'package:dplyr':
## 
##     id
dim(ex_data)
## [1] 497 433

It’s a data set with many columns with all types of synthetic INCA-data (it is based on real data but everything is randomized and scrambled not to give any details about real patients, doctors, hospitals et cetera).

Le’s chose a subset of columns for illustrative purpose:

x <-
  ex_data %>%
  dplyr::select(
    a_lkf,
    a_inrappdatum,
    a_inrappsjh,
    a_inrappklk,
    a_kompl,
    a_rappSjHemSj_Beskrivning
  )

Now, how are these variables stored?

dplyr::glimpse(x)
## Rows: 497
## Columns: 6
## $ a_lkf                     <chr> "018014", "143507", "149201", "228412", "14…
## $ a_inrappdatum             <chr> "1984-12-10", "1984-06-20", "1986-08-07", "…
## $ a_inrappsjh               <chr> "53804", "53346", "99617", "53334", "99300"…
## $ a_inrappklk               <chr> "292", "883", "952", "570", "331", "207", "…
## $ a_kompl                   <chr> "", "", "", "", "True", "True", "", "", "",…
## $ a_rappSjHemSj_Beskrivning <chr> "Ja", "Ja", "Nej", "Ja", "Nej", "Ja", "Nej"…

We can see that:

  • a_inrappdatum looks like a date but is treated as character
  • a_lkf, a_inrappsjh and a_inrappklk look like numerics but are treated as characters.
  • a_kompl looks like a Boolean but is a factor
  • a_rappSjHemSj_Beskrivning looks like a factor and is … a factor :-)

We now want to change these formats to get something more natural to work with.

Function as.incadata

as.incadata is one of the main functions of the package. It takes either a single vector or a data frame and converts it to a format more relevant for RCC data.

The output message is quite verbose. This is intended since it is probably a good idea to check that all columns are coerced to reasonable formats.

x2 <- as.incadata(x)
## The following variables have new formats: 
## * a_inrappdatum (character -> Date)
## * a_inrappsjh   (character -> integer)
## * a_kompl       (character -> logical)
## Warning: a_lkf -> a_lkf_lan_beskrivning: transformed to match the keyvalue:  Only the first 2 characters are used.
## Warning: a_lkf -> a_lkf_kommun_beskrivning: transformed to match the keyvalue:  Only the first 4 characters are used.
## Warning: a_lkf -> a_lkf_hemort_beskrivning: Some codes could not be translated (50 cells)
## New decoded columns added: 
## * a_lkf_lan_beskrivning
## * a_lkf_kommun_beskrivning
## * a_lkf_forsamling_beskrivning
## * a_lkf_hemort2_beskrivning
## * a_lkf_hemort_beskrivning
## rownames used as id!

Let’s have a closer look at the result:

dplyr::glimpse(x2)
## Rows: 497
## Columns: 12
## $ a_lkf                        <chr> "018014", "143507", "149201", "228412", …
## $ a_inrappdatum                <date> 1984-12-10, 1984-06-20, 1986-08-07, 198…
## $ a_inrappsjh                  <int> 53804, 53346, 99617, 53334, 99300, 11011…
## $ a_inrappklk                  <chr> "292", "883", "952", "570", "331", "207"…
## $ a_kompl                      <lgl> NA, NA, NA, NA, TRUE, TRUE, NA, NA, NA, …
## $ a_rappsjhemsj_beskrivning    <chr> "Ja", "Ja", "Nej", "Ja", "Nej", "Ja", "N…
## $ a_lkf_lan_beskrivning        <chr> "Stockholms län", "Västra Götalands län"…
## $ a_lkf_kommun_beskrivning     <chr> "Stockholm", "Tanum", "Åmål", "Örnskölds…
## $ a_lkf_forsamling_beskrivning <chr> "Högalid", "Fjällbacka", "Åmål", "Björna…
## $ a_lkf_hemort2_beskrivning    <chr> "Högalid", "Fjällbacka", "Åmål", "Björna…
## $ a_lkf_hemort_beskrivning     <chr> "Katarina", "Fjällbacka", "Åmål", "Björn…
## $ id                           <chr> "1", "2", "3", "4", "5", "6", "7", "8", …

Some things have happened:

  • All variable names are transformed to lower case since these are generally easier to work with (note especially a_rappSjHemSj_Beskrivning -> a_rappsjhemsj_beskrivning). If two (or more) variable names differ only with regard to case, this will be handled adequately.
  • a_inrappdatum is now a date! To recognize dates, especially from Rockan, but sometimes also from INCA has a vignette of its own.
  • Some numeric variables are now numeric but some are (still) treated as character. The problem with some INCA unit codes is that a leading zero might bear meaning. To treat such a variable as numeric would drop the zero and the codes would be messed up. as.incadata therefore only treat numbers with non-leading zeroes as numeric (it also distinguish between integers and decimal numbers and it translates the Swedish decimal coma to an English decimal point.
  • a_kompl is now Boolean and this will happen regardless if we work on INCA (where Booleans are stored as 0/1 or locally where the same values are transformed to “True” or blanks).
  • There is a new id column pointing to individual patients. This variable will be based on either personal identification number, patient id or a simple row number. The idea is that this variable have different names depending on the source (INCA/Rockan) and it is easier to always have an id column with the same name. Also if a personal identification number is included in the data, this will be checked (by sweidnumbr), while the id column will not.
  • There are also some new columns names like a_lkf_xxx_beskrivning. These are all based on the fact that a_lkf is a code variable recognized by the decoder package. It was only included as a numeric (coded) variable in the original data. It has now been supplemented with descriptive names of different regions based on the LKF code.
  • There are no factor variables! Factor variables can sometimes be useful but often not. We chose to avoid them by default.

Function use_incadata

Another important function from the package is use_incadata. It could be thought of as read.incadata but it is constructed to work also on INCA (where the data is already available in a data frame named “df” and therefore not read from disk).

This function has three main advantages:

  1. It can (in contrast to read.csv2 or similar) be used both locally and in INCA so there is no need to have different scripts for development and production.
  2. It uses a cache mechanism to increase speed. If the data set is big, the use of as.incadata might be slow. use_incadata only perform this coercion once, and then use a cache mechanism automatically. If the original data file is changed (a new export from INCA), the cache will be updated automatically after comparison of MD5 check sums. (The cache mechanism is intentionally ignored if calling the function from INCA, where the data should always be fresh).
  3. Also, as noted above, the output from as.incadata is quite verbose (for good reason) but if using the same data over and over again, it might not be meaningful to report these messages every time, which use_incadata does not.

Example

Let’s use the same data as above. We save the data to disk as a csv2-file to simulate an exported INCA file.

# Save data as csv2 in temp file
fl <- tempfile("ex_data", fileext = ".csv2")
write.csv2(incadata::ex_data, fl, row.names = FALSE)

Let us now use the data for the “first time”. The process will be verbose (but we omit it here just to save space). When working locally, the cache will be saved next to the original file (from where it can be copied or removed as a regular file). We time the process to compare the speed with later attempt:

##    user  system elapsed 
##   4.184   0.165   4.467

Now, let’s assume that we for some reason has to restart the process all over again (and let´s time it again for the sake of comparison):

##    user  system elapsed 
##   0.175   0.007   0.189

Voila! Data is already in a good format and process was faster than before!