R + Shiny + randomForest with base graphics and ggplot2
Some time ago I began work on a Shiny web app for nonparametric regression and classification (“supervised learning”, “machine learning”, insert other magical thinking-inducing jargon and buzzwords of your choice), specifically, showcasing the use of the Random Forest recursive partitioning algorithm in R from the
RandomForest package. The app is still under development and needs some work but is well enough along to share. For the time being, the app does not have many input fields and only one dataset. At this stage in its development I have focused my efforts on coding the graphical outputs, using both base graphics and
Output graphics, tables and summaries include:
- Mean classification error for the different response classes
- Confusion matrix
- Class margins (defined by ‘majority vote’ method)
- Partial dependance plot
- Outlier plot
- 2D multidimensional scaling
- Multiple importance measures: Out of Bag, GINI, and importance matrix
- Error rates
- Variable use (node splits)
- Stepwise variable selection using replicated, nested 5-fold cross-validation
There is one major caveat I need to make clear. The ‘Number of Variables’ panel allows the user to perform nested 5-fold cross-validation on a sequentially reduced predictor set with replication. This is time-intensive. Fortunately, on RStudio’s Spark server where the app is hosted at the time of this posting, I seem to be able to access multiple processing cores and by writing my R code for parallelization and using the base R
parallel package, I cut the processing time down substantially. In general it seems to take about 30 seconds to run.
However, at this time, the app is hosted under a free Shiny Server account with RStudio. One of the drawbacks is multiple users visiting an app url in their browser have their interactions with the one existing R session placed in a queue. They must take turns waiting for each other’s submissions to run. For the most part no one ever notices any lag because the reactivity in Shiny apps is fairly instantaneous. But here is a process that takes noticeable time to execute. One thing that is good practice in this situation is to intentionally introduce a break in the reactivity. I wouldn’t want this running against a user’s explicit intent, for example if they clicked on ‘Number of Variables’ in the sidebar by accident, or seeing what it is, deciding they do not want to run it. So I introduce the action button for these kinds of situations. Although changes to various inputs, or even just visiting the panel, could trigger a reactive flush, in this case all other inputs are isolated from reactivity and nothing will refresh and the replicated cross validation will not execute unless the user presses the action button.
Breaking the reactivity with an action button and including some notification on the page describing what will occur is an acceptable approach when considering one user. But say the user decides to run the cross-validation. They may not mind waiting 30 seconds for this process to produce results. They might even be content to wait for 30 minutes. However, if other users are simultaneously interacting with the app, or visit the app while this process has been executed by another user, they will go into the queue. From their perspective the app will be non-responsive. This is not acceptable. But this is a drawback to the free version of Shiny Server. Until I have the opportunity to upgrade to the commercial version (which I believe may have just been released around the time of this posting), it is something to be aware of if you are using the app. With the commercial version, however, there is the option to configure an app to launch more R processes than just the one, up to as many as one unique R process for every visitor.
Another thing to note is at this time, I am using a developmental version of Shiny (version 0.8.0.99), which includes the otherwise yet to be released
navlistPanel feature. [Update: This feature is now available in the recently released Shiny version 0.9.0.] Until I have time to generalize the app further, I have restricted it to categorical response variables. So it will only do classification, not regression. I need to code some regression-specific plots and add some code to handle the need to display certain plots only when they make sense (and can plot without error). For example, choosing a numerical response variable currently would break any plots that work only with, and make sense for, categorical response data, such as the class margins plot. Presently, the response variable menu only populates with categorical variables from the full list of covariates, but numerical variables are still available under predictors.