The goal of this tutorial is to introduce the interactive history browser implemented in the experiment
package. It follows one of the examples accessible via experiment::simulate_london_meters
and is based on the London meter data.
History browser keeps track of all expressions evaluated in R session. It remembers all objects and plots, and allows the user to move back and forth in that recorded history.
In this short introduction, we will perform a simplified data exploration exercise, similar to what a “real” data exploration might look like. In order to keep the big picture clean, we avoid poking around too much.
We start by loading a number of packages we will need for our analysis. History tracker does not write down commands that do not produce new objects or plots, so it ignores this next block of code.
library(dplyr)
library(lubridate)
library(magrittr)
library(ggplot2)
Now it is time to load the experiment
package and turn on its tracing capability. experiment
will register a callback using addTaskCallback
and using that callback it will keep record of changes in the global environment of our R session1.
library(experiment)
tracking_on()
#> Warning: creating a store named "project-store" under
#> "/home/user/my-data-project"
Calling tracking_on()
in a live R session will change the R prompt to [tracked] >
. In this vignette, in order to make it easier to copy the R code, the promp remains hidden.
Another important thing to notice is the warning “creating a store named…” which informs the user that all objects created in the current session will be stored in a newly created object store2. Thus, it is perfectly possible to peform the exercise described in this vignette over a period of multiple days, while closing and reopening R session to pick up the work where it was previously left off.
Here is the first command that produces a (new) data object. It reads, transforms and filters a CSV file distributed with the experiment
package.
input <-
system.file("extdata/block_62.csv", package = "experiment") %>%
readr::read_csv(na = 'Null') %>%
rename(meter = LCLid, timestamp = tstp, usage = `energy_kWh`) %>%
filter(meter %in% c("MAC004929", "MAC000010", "MAC004391"),
year(timestamp) == 2013)
Let’s look at the data. It turns out that the observations are recorded every 30 minutes.
head(input)
#> # A tibble: 6 x 3
#> meter timestamp usage
#> <chr> <dttm> <dbl>
#> 1 MAC000010 2013-01-01 00:00:00 0.509
#> 2 MAC000010 2013-01-01 00:30:00 0.453
#> 3 MAC000010 2013-01-01 01:00:00 0.500
#> 4 MAC000010 2013-01-01 01:30:00 0.621
#> 5 MAC000010 2013-01-01 02:00:00 0.197
#> 6 MAC000010 2013-01-01 02:30:00 0.176
Let’s agregate them and continue with hourly readings.
input %<>%
mutate(timestamp = floor_date(timestamp, 'hours')) %>%
group_by(meter, timestamp) %>%
summarise(usage = sum(usage))
We have three meters in the data set, MAC000010, MAC004391, MAC004929. We will look at them one by one, starting with this one.
input %<>% filter(meter == "MAC004929")
Just a glimpse on the full data set, before we look aggregations.
with(input, plot(timestamp, usage, type = 'p', pch = '.'))
All right! That doesn’t reveal much, how about breaking the data set down by hour and day of week? Any patterns here? We start with aggregating the input
set into a temporary variable x
.
x <-
input %>%
mutate(hour = hour(timestamp),
dow = wday(timestamp, label = TRUE)) %>%
mutate_at(vars(hour, dow), funs(as.factor)) %>%
group_by(hour, dow) %>%
summarise(usage = mean(usage, na.rm = TRUE))
And now we can take a look at the by-hour plot:
with(x, plot(hour, usage))
And the hour-by-day-of-the-week breakdown:
ggplot(x) + geom_point(aes(x = hour, y = usage)) + facet_wrap(~dow)
So these are mean values. How about the distribution arund the mean? We can visualize that with a boxplot. Start with overwriting the x
variable and then produce a new plot.
x <-
input %>%
mutate(hour = hour(timestamp),
dow = wday(timestamp)) %>%
mutate_at(vars(hour, dow), funs(as.factor))
ggplot(x) + geom_boxplot(aes(x = hour, y = usage)) + facet_wrap(~dow)
OK! Let’s look at a linear model for this data.
m <- lm(usage ~ hour:dow, x)
summary(m)
#>
#> Call:
#> lm(formula = usage ~ hour:dow, data = x)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.04183 -0.19047 -0.03992 0.08349 3.09831
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.761096 0.050023 15.215 < 2e-16 ***
#> hour0:dow1 -0.124288 0.070744 -1.757 0.078973 .
#> hour1:dow1 -0.270596 0.070744 -3.825 0.000132 ***
#> hour2:dow1 -0.478827 0.070744 -6.768 1.39e-11 ***
...
#> hour22:dow7 -0.007462 0.070744 -0.105 0.916003
#> hour23:dow7 NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3607 on 8592 degrees of freedom
#> Multiple R-squared: 0.3471, Adjusted R-squared: 0.3344
#> F-statistic: 27.35 on 167 and 8592 DF, p-value: < 2.2e-16
At this point we might decide we know enough. (We probably don’t yet, but for the sake of the presentation, we let’s assume we actually do. After all this is an introduction to the history browser, not to time series analysis.)
So what does the history look like so far? We can open an interactive viewer by calling experiment:::browserAddin()
. It is a htmlwidget
so when you do it in actual R session in RStudio, it will open in an interactive window, overlying the main RStudio window34. In RStudio you will also have extra buttons and interactions, more about this in the next section.
experiment::browserAddin()
Each node represents either an object introduced at some point in time to R session, or a plot. Objects have their names displayed inside the node, plots are shown as thumbnails.
You can hover your mouse cursor over each node in the history and see the expression that produced the given object along with its general characteristics, like dimensions of a data.frame
or the AIC
value for a linear model.
You can also zoom in and out - when zooming out far enough, the view switches from showing all individual nodes to showing groups of nodes. The nodes are group according to their creating time. Hovering mouse over a group reveals the names of its nodes.
Let’s go back to the last step before narrowing down to just one meter. Clicking on the second input
node in the history window selects that node (notice the green highlight on the border of the node). In RStudio, at this point you need to click on the Done button, but since this is a static HTML vignette, that button is not available. The Done button together with the whole window title bar looks as below:
Thus, highlighting the node and clicking on the Done button brings back the state of R session when that object was created - which we will assume happens at this point of our vignette. We restore state of the R session from the time when the second input
node was created.
Now we can try a different house.
input %<>% filter(meter == "MAC000010")
We aggregate the data with the same query as before and look at the boxplot. Anything interesting here?
x <-
input %>%
mutate(hour = hour(timestamp),
dow = wday(timestamp)) %>%
mutate_at(vars(hour, dow), funs(as.factor))
ggplot(x) + geom_boxplot(aes(x = hour, y = usage)) + facet_wrap(~dow)
The history looks different now, as there is a second branch reflecting the last three commands we have just issued.
experiment::browserAddin()
OK, so how about the third house in the data set? We restore the same point in time again, and repeat the same sequence of commands.
input %<>% filter(meter == "MAC004391")
x <-
input %>%
mutate(hour = hour(timestamp),
dow = wday(timestamp)) %>%
mutate_at(vars(hour, dow), funs(as.factor))
ggplot(x) + geom_boxplot(aes(x = hour, y = usage)) + facet_wrap(~dow)
As we can see, the history gets updated again to reflect the third branching on the third house in the data set.
experiment::browserAddin()
Our last step will be reducing the size of the history graph presented in the widget. We do it with the query_by()
function. Let’s start with finding all variables named input
.
h <- query_by(is_named('input'))
Looking at the history graph reveals that it is now much smaller.
plot(h)
How about finding only data frames?
h <- query_by(inherits('data.frame'))
plot(h)
And finally we ask to see only the plots.
h <- query_by(inherits('plot'))
plot(h)
In case you have problems when rebuilding this vignette, here is what my current R session is like:
library(devtools)
devtools::session_info()
#> setting value
#> version R version 3.4.3 (2017-11-30)
#> system x86_64, linux-gnu
#> ui X11
#> language en_US
#> collate en_US.UTF-8
#> tz America/Los_Angeles
#> date 2018-01-29
#>
#> package * version date source
#> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
#> backports 1.1.1 2017-09-25 CRAN (R 3.4.2)
#> base * 3.4.3 2017-12-01 local
#> bindr 0.1 2016-11-13 CRAN (R 3.4.2)
#> bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.2)
#> broom 0.4.2 2017-02-13 CRAN (R 3.4.2)
#> clisymbols 1.2.0 2017-05-21 CRAN (R 3.4.2)
#> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.1)
#> compiler 3.4.3 2017-12-01 local
#> crayon 1.3.4 2017-09-16 CRAN (R 3.4.2)
#> datasets * 3.4.3 2017-12-01 local
#> defer 0.3.0 2017-12-26 local
#> devtools * 1.13.4 2017-11-09 CRAN (R 3.4.2)
#> digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
#> dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
#> evaluate * 0.10.1 2017-06-24 CRAN (R 3.4.2)
#> experiment * 0.1 2018-01-29 local
#> foreign 0.8-69 2017-06-21 CRAN (R 3.4.2)
#> ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.1)
#> glue 1.1.1 2017-06-21 CRAN (R 3.4.2)
#> graphics * 3.4.3 2017-12-01 local
#> grDevices * 3.4.3 2017-12-01 local
#> grid 3.4.3 2017-12-01 local
#> gtable 0.2.0 2016-02-26 CRAN (R 3.4.1)
#> hms 0.3 2016-11-22 CRAN (R 3.4.0)
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
#> htmlwidgets 0.9 2017-07-10 cran (@0.9)
#> jsonlite 1.5 2017-06-01 CRAN (R 3.4.2)
#> knitr * 1.17 2017-08-10 CRAN (R 3.4.2)
#> labeling 0.3 2014-08-23 CRAN (R 3.4.1)
#> lattice 0.20-35 2017-03-25 CRAN (R 3.4.2)
#> lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
#> lubridate * 1.6.0 2016-09-13 CRAN (R 3.4.0)
#> magrittr * 1.5 2014-11-22 CRAN (R 3.4.0)
#> memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
#> methods * 3.4.3 2017-12-01 local
#> mnormt 1.5-5 2016-10-15 CRAN (R 3.4.2)
#> munsell 0.4.3 2016-02-13 CRAN (R 3.4.1)
#> nlme 3.1-131 2017-02-06 CRAN (R 3.4.2)
#> parallel 3.4.3 2017-12-01 local
#> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.2)
#> plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
#> psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
#> purrr 0.2.4 2017-10-18 CRAN (R 3.4.2)
#> R6 2.2.2 2017-06-17 CRAN (R 3.4.2)
#> Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2)
#> readr 1.1.1 2017-05-16 CRAN (R 3.4.0)
#> reshape2 1.4.2 2016-10-22 CRAN (R 3.4.1)
#> rlang 0.1.4 2017-11-05 cran (@0.1.4)
#> rmarkdown 1.6 2017-06-15 CRAN (R 3.4.2)
#> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
#> rsvg 1.1 2017-03-21 CRAN (R 3.4.3)
#> scales 0.5.0 2017-08-24 CRAN (R 3.4.1)
#> stats * 3.4.3 2017-12-01 local
#> storage 0.1.0 2018-01-22 local
#> stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
#> stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
#> testthat * 1.0.2.9000 2017-10-22 local
#> tibble 1.3.4 2017-08-22 CRAN (R 3.4.2)
#> tidyr 0.7.2 2017-10-16 CRAN (R 3.4.2)
#> tools 3.4.3 2017-12-01 local
#> utils * 3.4.3 2017-12-01 local
#> withr 2.0.0 2017-07-28 CRAN (R 3.4.2)
#> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
Of course, that’s not how things work inside knitr. Thus, this vignette contains code as it is supposed to look, and you should be able to simply copy & paste it into R session. The actual source code of the vignette will reveal much more than that.↩
Object store is a persistent, filesystem-based repository of R artifacts (data sets, functions, plots, etc.) produced while working with R.↩
The mechanics of that are not part of this vignette, however, in case this turns out to be helpful: in RStudio it is a gadget, created with shiny::runGadget
, and displayed in a shiny::dialogViewer
.↩
Also, in RStudio, you can map this function under a key shortcut. experiment
contains all the necessary configuration files. See RStudio Addins for more details.↩