Skip to content

R Statistics Support

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

In MIKE OPERATIONS, R for Windows is supported from MIKE Workbench by introducing a tool, taking R scripts as arguments. Please refer to the MIKE Workbench Help file for information about the R tools.

Configuration

Note that the R tools described in this chapter, is supported using R libraries before R version 3.5. In MIKE OPERATIONS, R-Statistics tools is supported through configuration of stations. This is done by introducing a number of new time series types (object types).

  • Skill Scores.
  • Confidence Intervals.
  • Goodness of Fit.

When adding a new statistics time series (object), additional information must be specified on the time series. This is done in the details view of the time series on the Statistics Tab.

  • Statistics Results
    For Skill Scores and Goodness of Fit calculations, analysis results are saved in spreadsheets in the MIKE Workbench.
    Specify the Results Spreadsheet by clicking the eclipse button of the input control.
  • Error model
    For Confidence Intervals, the results of the analysis are saved as error models in the document manager of the MIKE Workbench.
    Select the error model from the Error model drop down box.
  • Measure Selection
    For each statistics type, a list of available measures can be selected. Select the measures to be displayed for each station in the check box list.
  • Reference Time Series
    Specify the reference time series used as base for the statistics.
  • Historical Simulation Path
    Path to a simulation used for calculation Last Error. Last error can be taken into account when creating confidence intervals.
    Last error is calculated as the last observed value on a specified date and time as a historical simulation result (obs. value – sim. value).
    When calculating the Last error, the observation time series is found as the first time series definition of the feature type of the type “Time series”.
  • Max. Search Window
    The maximum search window used when finding the “last error” when creating confidence intervals.
    The maximum search windows is specified as a time span back from Time of Forecast (period before ToF).
    The time span is specified as days, hours, minutes and seconds (dd:hh:mm:ss).

When the configuration has been completed, the results can be displayed when clicking a station in the map or in the table view of MIKE OPERATIONS.

Quality verification of data collection spreadsheet

Skill scores and goodness-of-fit measures and the error models are sensitive to the quality of observations as they are regarded as the true value of what is modelled. Error in collected observations are therefore reflected in the statistics and can render the results useless. In the example below is a scatterplot of collected simulated and observed discharge pairs at 10 hours forecast lead-time. Missing observation values has the default value of -999 and several observations are seen to have negative values.

If we look at the Nash-Sutcliffe index for the above dataset, as seen to the right below it does not give us much information about the forecast quality as it changes with lead-time. To the left all values below zero has been removed and the decay in forecast skill is visible from the dataset.

If the error model is trained on bad data, the observation errors are present in the calculated confidence bands. This is clear from the two plots below where the top plot includes the complete dataset and the bottom show the model trained on the dataset where negative observations has been removed. Proper quality verification of the observed data is therefore very important when used in the statistical applications.