Skip to contents

Introduction

data.validator is a set of tools for creating reports based on validation results from assertr.

It provides tools for creating user-friendly reports that you can send by email, store in a logs folder, or generate automatically with RStudio Connect.

Validate data

This is a basic example of how to validate data:

library(assertr)
library(dplyr)

validate(mtcars) %>%
  validate_cols(description = "vs and am values equal 0 or 2 only",
         in_set(c(0, 2)), vs, am) %>%
  validate_cols(description = "vs and am values should equal 3 or 4",
         skip_chain_opts = TRUE,
         error_fun = warning_append, in_set(c(3, 4)), gear, carb) %>%
  validate_rows(description = "Each row sum for am:vs columns is less or equal 1",
              rowSums, within_bounds(0, 1), vs:am) %>%
  validate_cols(description = "For wt and qsec we have: abs(col) < 2 * sd(col)",
         within_n_sds(2), wt, qsec) %>%
  validate_if(description = "Column drat has only positive values",
         drat > 0) %>%
  validate_if(description = "Column drat has only values larger than 3",
         drat > 3) %>%
  add_results(report)

See the assertr vignette for the full specification.

Present results with data.validator

The first step is to create a validator.

Next we have to add validation results to the validator.

library(assertr)
library(dplyr)

validate(mtcars) %>%
  validate_cols(description = "vs and am values equal 0 or 2 only",
         in_set(c(0, 2)), vs, am) %>%
  validate_cols(description = "vs and am values should equal 3 or 4",
         skip_chain_opts = TRUE, error_fun = warning_append,
         in_set(c(3, 4)), gear, carb) %>%
  validate_rows(description = "Each row sum for am:vs columns is less or equal 1",
              rowSums, within_bounds(0, 1), vs:am) %>%
  validate_cols(description = "For wt and qsec we have: abs(col) < 2 * sd(col)",
         within_n_sds(2), wt, qsec) %>%
  validate_if(description = "Column drat has only positive values",
         drat > 0) %>%
  validate_if(description = "Column drat has only values larger than 3",
         drat > 3) %>%
  add_results(report)

Finally, we use one of the available methods to present the results.

Either print the summary:

print(report)
#> Validation summary: 
#>  Number of successful validations: 1
#>  Number of validations with warnings: 1
#>  Number of failed validations: 4
#> 
#> Advanced view: 
#> 
#> 
#> |table_name |description                                       |type    | total_violations|
#> |:----------|:-------------------------------------------------|:-------|----------------:|
#> |mtcars     |Column drat has only positive values              |success |               NA|
#> |mtcars     |Column drat has only values larger than 3         |error   |                4|
#> |mtcars     |Each row sum for am:vs columns is less or equal 1 |error   |                7|
#> |mtcars     |For wt and qsec we have: abs(col) < 2 * sd(col)   |error   |                4|
#> |mtcars     |vs and am values equal 0 or 2 only                |error   |               27|
#> |mtcars     |vs and am values should equal 3 or 4              |warning |               24|

or save it as an HTML report.

save_report(report)

Creating custom reports

Define a function that has a validation_results parameter and returns an HTML object or HTML widget. The validation_results parameter is assumed to be passed as a results table extracted with get_results(validator).

Note The function can also store optional parameters that should be passed to the save_report function while generating a new report.

In this example we create a custom report that shows validation results of checking whether population across Polish counties fits within 3 standard deviations.

library(magrittr)
library(assertr)
library(data.validator)

report <- data_validation_report()

file <- system.file("extdata", "population.csv", package = "data.validator")
population <- read.csv(file, colClasses = c("character", "character", "character",
                                            "integer", "integer", "integer"))

validate(population) %>%
  validate_cols(within_n_sds(3), total) %>%
  add_results(report)

print(report)
#> Validation summary: 
#>  Number of successful validations: 0
#>  Number of validations with warnings: 0
#>  Number of failed validations: 1
#> 
#> Advanced view: 
#> 
#> 
#> |table_name |description |type  | total_violations|
#> |:----------|:-----------|:-----|----------------:|
#> |population |NA          |error |                6|

We can also present the results on a Leaflet map.

render_leaflet_report <- function(validation_results, population_data, correct_col, violated_col) {
  file <- system.file("extdata", "counties.json", package = "data.validator")
  states <- rgdal::readOGR(file, GDAL1_integer64_policy = TRUE, verbose = FALSE)

  violated <- validation_results %>%
    tidyr::unnest(error_df, keep_empty = TRUE) %>%
    dplyr::pull(index)

  states@data <- dplyr::left_join(states@data, population_data,
                                  by = c("JPT_KOD_JE" = "county_ID"))
  states@data$color <- correct_col
  states@data$color[violated] <- violated_col
  states@data$label <- glue::glue("County: {states@data$county} <br>",
                                  "Population: {states@data$total}")

  htmltools::tagList(
    htmltools::h2("Counties not fitting within 3 standard deviations"),
    leaflet::leaflet(states) %>%
      leaflet::addPolygons(color = "#444444", weight = 1, smoothFactor = 0.5,
                           opacity = 0.5, fillOpacity = 0.5,
                           fillColor = states@data$color,
                           label = states@data$label %>% lapply(htmltools::HTML),
                           highlightOptions = leaflet::highlightOptions(color = "white",
                                                                        weight = 2,
                                                                        bringToFront = TRUE))
  )
}

save_report(
  report,
  ui_constructor = render_leaflet_report,
  population_data = population,
  correct_col = "#52cf0a",
  violated_col = "#bf0b4d"
)

Creating custom report templates

In order to generate R Markdown reports data.validator uses a predefined report template like the one below.

---
title: Data validation report
output: html_document
params:
  generate_report_html: !expr function(...) {}
  extra_params: !expr list()
---

#### `r format(Sys.time(), "%Y-%m-%d %H:%M:%S")`

```{r generate_report, echo = FALSE}
params$generate_report_html(params$extra_params)
```

You can use the default template as a basis for creating your own template. In order to do this, first load the package in RStudio. Then select FileNew FileR MarkdownFrom TemplateSimple structure for HTML report summary.

Next modify the template by adding for example a custom title or graphics. Leave the params section in the header unchanged, as well as the generate_report content renderer chunk.

When calling the save_report function, make sure that to specify the path to the custom template in the template parameter.

Using the package in production

The package is successfully used by Appsilon in a production environment for protecting Shiny apps against being run on incorrect data.

The workflow is based on the steps below:

  1. Running RStudio Connect Scheduler daily.

  2. The scheduler sources the data from a PostgreSQL table and validates it based on predefined rules.

  3. Based on validation results a new data.validator report is created.

    1. When data validation rules are violated:
    • The data provider and the person responsible for data quality receive a report via email. Thanks to assertr functionality, the report is easily understandable for both technical and non-technical persons.

    • The data provider makes the required data fixes.

    1. When the data meets all validation rules:
    • A specific trigger is sent in order to reload the data in the Shiny app.

More examples

For more options check the package documentation or examples.