Introduction
data.validator is a set of tools for creating reports
based on validation results from assertr.
It provides tools for creating user-friendly reports that you can send by email, store in a logs folder, or generate automatically with RStudio Connect.
Validate data
This is a basic example of how to validate data:
library(assertr)
library(dplyr)
validate(mtcars) %>%
  validate_cols(description = "vs and am values equal 0 or 2 only",
         in_set(c(0, 2)), vs, am) %>%
  validate_cols(description = "vs and am values should equal 3 or 4",
         skip_chain_opts = TRUE,
         error_fun = warning_append, in_set(c(3, 4)), gear, carb) %>%
  validate_rows(description = "Each row sum for am:vs columns is less or equal 1",
              rowSums, within_bounds(0, 1), vs:am) %>%
  validate_cols(description = "For wt and qsec we have: abs(col) < 2 * sd(col)",
         within_n_sds(2), wt, qsec) %>%
  validate_if(description = "Column drat has only positive values",
         drat > 0) %>%
  validate_if(description = "Column drat has only values larger than 3",
         drat > 3) %>%
  add_results(report)See the assertr
vignette for the full specification.
Present results with data.validator
The first step is to create a validator.
library(data.validator)
report <- data_validation_report()Next we have to add validation results to the validator.
library(assertr)
library(dplyr)
validate(mtcars) %>%
  validate_cols(description = "vs and am values equal 0 or 2 only",
         in_set(c(0, 2)), vs, am) %>%
  validate_cols(description = "vs and am values should equal 3 or 4",
         skip_chain_opts = TRUE, error_fun = warning_append,
         in_set(c(3, 4)), gear, carb) %>%
  validate_rows(description = "Each row sum for am:vs columns is less or equal 1",
              rowSums, within_bounds(0, 1), vs:am) %>%
  validate_cols(description = "For wt and qsec we have: abs(col) < 2 * sd(col)",
         within_n_sds(2), wt, qsec) %>%
  validate_if(description = "Column drat has only positive values",
         drat > 0) %>%
  validate_if(description = "Column drat has only values larger than 3",
         drat > 3) %>%
  add_results(report)Finally, we use one of the available methods to present the results.
Either print the summary:
print(report)
#> Validation summary: 
#>  Number of successful validations: 1
#>  Number of validations with warnings: 1
#>  Number of failed validations: 4
#> 
#> Advanced view: 
#> 
#> 
#> |table_name |description                                       |type    | total_violations|
#> |:----------|:-------------------------------------------------|:-------|----------------:|
#> |mtcars     |Column drat has only positive values              |success |               NA|
#> |mtcars     |Column drat has only values larger than 3         |error   |                4|
#> |mtcars     |Each row sum for am:vs columns is less or equal 1 |error   |                7|
#> |mtcars     |For wt and qsec we have: abs(col) < 2 * sd(col)   |error   |                4|
#> |mtcars     |vs and am values equal 0 or 2 only                |error   |               27|
#> |mtcars     |vs and am values should equal 3 or 4              |warning |               24|or save it as an HTML report.
save_report(report)Creating custom reports
Define a function that has a validation_results
parameter and returns an HTML object or HTML widget. The
validation_results parameter is assumed to be passed as a
results table extracted with get_results(validator).
Note The function can also store optional parameters that
should be passed to the save_report function while
generating a new report.
In this example we create a custom report that shows validation results of checking whether population across Polish counties fits within 3 standard deviations.
library(magrittr)
library(assertr)
library(data.validator)
report <- data_validation_report()
file <- system.file("extdata", "population.csv", package = "data.validator")
population <- read.csv(file, colClasses = c("character", "character", "character",
                                            "integer", "integer", "integer"))
validate(population) %>%
  validate_cols(within_n_sds(3), total) %>%
  add_results(report)
print(report)
#> Validation summary: 
#>  Number of successful validations: 0
#>  Number of validations with warnings: 0
#>  Number of failed validations: 1
#> 
#> Advanced view: 
#> 
#> 
#> |table_name |description |type  | total_violations|
#> |:----------|:-----------|:-----|----------------:|
#> |population |NA          |error |                6|We can also present the results on a Leaflet map.
render_leaflet_report <- function(validation_results, population_data, correct_col, violated_col) {
  file <- system.file("extdata", "counties.json", package = "data.validator")
  states <- rgdal::readOGR(file, GDAL1_integer64_policy = TRUE, verbose = FALSE)
  violated <- validation_results %>%
    tidyr::unnest(error_df, keep_empty = TRUE) %>%
    dplyr::pull(index)
  states@data <- dplyr::left_join(states@data, population_data,
                                  by = c("JPT_KOD_JE" = "county_ID"))
  states@data$color <- correct_col
  states@data$color[violated] <- violated_col
  states@data$label <- glue::glue("County: {states@data$county} <br>",
                                  "Population: {states@data$total}")
  htmltools::tagList(
    htmltools::h2("Counties not fitting within 3 standard deviations"),
    leaflet::leaflet(states) %>%
      leaflet::addPolygons(color = "#444444", weight = 1, smoothFactor = 0.5,
                           opacity = 0.5, fillOpacity = 0.5,
                           fillColor = states@data$color,
                           label = states@data$label %>% lapply(htmltools::HTML),
                           highlightOptions = leaflet::highlightOptions(color = "white",
                                                                        weight = 2,
                                                                        bringToFront = TRUE))
  )
}
save_report(
  report,
  ui_constructor = render_leaflet_report,
  population_data = population,
  correct_col = "#52cf0a",
  violated_col = "#bf0b4d"
)
Creating custom report templates
In order to generate R Markdown reports data.validator
uses a predefined report template like the one below.
---
title: Data validation report
output: html_document
params:
  generate_report_html: !expr function(...) {}
  extra_params: !expr list()
---
#### `r format(Sys.time(), "%Y-%m-%d %H:%M:%S")`
```{r generate_report, echo = FALSE}
params$generate_report_html(params$extra_params)
```
You can use the default template as a basis for creating your own template. In order to do this, first load the package in RStudio. Then select File → New File → R Markdown → From Template → Simple structure for HTML report summary.
Next modify the template by adding for example a custom title or
graphics. Leave the params section in the header unchanged,
as well as the generate_report content renderer chunk.
When calling the save_report function, make sure that to
specify the path to the custom template in the template
parameter.
Using the package in production
The package is successfully used by Appsilon in a production environment for protecting Shiny apps against being run on incorrect data.
The workflow is based on the steps below:
Running RStudio Connect Scheduler daily.
The scheduler sources the data from a PostgreSQL table and validates it based on predefined rules.
Based on validation results a new
data.validatorreport is created.- 
- When data validation rules are violated:
 
The data provider and the person responsible for data quality receive a report via email. Thanks to
assertrfunctionality, the report is easily understandable for both technical and non-technical persons.The data provider makes the required data fixes.
- When the data meets all validation rules:
 
- A specific trigger is sent in order to reload the data in the Shiny app.
 
 
More examples
For more options check the package documentation or examples.