---
title: "Writing User-Defined Summary Functions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Writing User-Defined Summary Functions}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}

---

```{r, include = FALSE}
suppressPackageStartupMessages(require(tangram.pipe))
suppressPackageStartupMessages(require(knitr))
suppressPackageStartupMessages(require(kableExtra))
```

# Overview
The package `tangram.pipe` can be used to iteratively build a table which
allows each row to be uniquely customized. All the possible changes can be seen
in the package's main vignette, "Customizeable Table Building with Tangram Pipe".
One main package feature which is not discussed there is that a user may write
their own summary function for the table rows. By default, the package will use
default summary functions to calculate a 5-number summary, plus the mean and standard
deviation, for numeric data; column-wise proportions are generated for categorical
and binary rows. Currently, there are a total of five prewritten numeric summary
functions, as well as four prewritten functions for both categorical and binary data.
However, it is often the case that a user wants to have increased flexibility
and format `tangram.pipe` output tables in a different way than provided by 
the currently-available options. This document is intended to walk a user through 
how to write a custom-made summary function, as well as some suggested inputs and 
outputs to include for user-defined summary functions using `tangram.pipe`.

# Function Inputs
To see how the default functions for summarizing data work, let's take a look at
the function usage for summarizing numerical data, `num_default`.

```{r, eval = FALSE}
num_default(dt, rowlabel, missing, digits)
```

All prewritten summary functions for numerical and categorical data take on a generic
form such as `num_default(dt, ...)`, where only the argument `dt` is required. However,
in order for these summary functions to work correctly, a total of four arguments
are passed to each of the functions.

1. `dt`: The dataset to use for the function must be passed into the summary tool.
However, the full dataset cannot be implemented into the summary function. Based on how
the summary is used in the row functions, the first column of `dt` must contain
the row information for the table; the second column should include the table's
column information, if applicable. Be sure that `dt` is a dataframe object.

2. `rowlabel`: This is the label you want to use for the row in the table. It should
match the rowlabel you specify in the row-defining function.

3. `missing`: A binary TRUE/FALSE variable which tells the function whether or
not to account for missing data. It should match the designation for missing data
from the row function.

4. `digits`: The number of significant digits to use in the summary.

Each of the prewritten functions use the ellipsis (`...`) in place of the final three
arguments to provide flexibility in writing custom functions. To write your own function,
the bare minimum requirement is that you provide an argument for the dataframe object
to use in the summary. The remaining arguments `rowlabel`, `missing`, and `digits`
are highly recommended to use within your custom function because it is within the summary
that these values, which are specified in the row-initialization, are implemented.
If you do not include them in your summary function, your specifications for these
arguments in the row initialization will not be present in your table for the row
of interest. Therefore, it is highly recommended, but not required, for you to include
these inputs in your summaries; excluding them will not break the package.

When you write a custom function, be sure to include all arguments outside `dt`
within the ellipsis (`...`). This is because `tangram.pipe`'s for functions will
input values for `rowlabel`, `missing`, and `digits` as done in the prewritten
functions. To provide additional flexibility, using (`...`) as the second argument
following `dt` will allow for differing arguments to be used while preventing
the custom function from inadvertently breaking the package row functions. You can
call your inputs within the function body by inputting (`...`) into a list and 
calling the elements of (`...`).

Note that none of the summary function's arguments should have default values. Since
summary functions are called within the row functions from `tangram.pipe`, the function
will end up taking the values already entered into the row function, so be sure
not to use default values here.  Below is an example usage of a generic summary function,
`summary_generic` using the `iris` dataset. Here, we want `Sepal.Length` to be the
row variable and `Species` to be the columns of the main table. Suppose we want to
call the row variable "Sepal Length (cm)", account for missing data, and use 2 significant
digits. First, we show the format the data (`dt`) must be in to pass it to the function

```{r}
iris %>% 
  select(Sepal.Length, Species) %>%
  head() %>%
  kable(escape = F, align = 'c') %>%
  trimws() %>%
  kable_styling(c("striped", "bordered"))
```

Note that the row variable is on the left and the column variable is to the right.
If you wanted to avoid splitting by `Species`, you would only pass the `Sepal.Length` 
information into the summary function.

Now, we show the code input needed for our generic summary function.

```{r, eval = FALSE}
summary_generic(dt = iris %>% select(Sepal.Length, Species), 
                rowlabel = "Sepal Length (cm)", 
                missing = TRUE, 
                digits = 2)
```

# Body of the function

When writing your summary function, it is important to take note of a few aspects
the function should be sure to incorporate within its text. The first important check
to make is whether your data includes a column variable or not. The data will be
structured differently depending on whether or not your data has two columns or
only one, so be sure your function can handle both types of data. 

Second, you will likely want to label your variable using the name specified in 
`rowlabel`. It is in the summary that the rowlabel specified in the row-initializing 
function is added to the table, so if you neclect this step, the final table will 
not have the label the user specifies during row initialization. 

Next, the function needs to calculate the amount of missing data if `missing = TRUE`
in the function. As with the `rowlabel` function, missing data handling is specified
in the row initialization but calculated within the summary function, so be sure to
write the function in such a way that missing data can be handled if specified.

Finally, be sure that all summary statistics are rounded based on the `digits` argument.
The `round` and `sprintf` functions are common tools used to accomplish this so
the table output can have a polished look.

# Required output

At a minimum, the output of each function should be a dataframe object. Any other
object type will cause the row function to fail since the final table, as well as
any comparison tests, need dataframes to combine the results together into the 
finished product. The rightmost column should also be the "Overall" column which
contains the summary statistics for the dataset as a whole without accounting
for the table's column variable. This is because the row functions will eliminate
this column if the user sets `overall = FALSE` during row initialization.

So long as the above two requirements are met, the summary function will not break 
the preexisting functions of `tangram.pipe`. However, there are certain naming 
recommendations so that the table is formatted well. Ideally, the row name should be
in the first column, called "Variable". A column labelling the type of measurement,
called "Measurement", as well as naming the overall column "Overall", will keep
the column names consistent with the package's default summary functions. If you
decide to use different names, it is recommended that you keep the names consistent
with each new summary function that you use and that you do not mix in rows with
default summary functions; mixing in different naming conventions will cause information
that should be contained in one column being spread out over multiple columns.

Below are example outputs from the preexisting `num_default` and `cat_default`
summary functions. It is recommended that you include columns for the variable name,
the measure type, the column categories (if applicable), and the overall column,
being sure to keep naming conventions consistent. As of version 1.1.0, `tangram.pipe`
default summaries now also calculate the total number of instances, N.

```{r}
iris %>%
  select(Sepal.Length, Species) %>%
  num_default(rowlabel = "Sepal Length (cm)", missing = TRUE, digits = 2) %>%
  kable(escape = F, align = 'l') %>%
  trimws() %>%
  kable_styling(c("striped", "bordered"))

iris %>%
  mutate(Stem.Size = sample(c("Small", "Medium", "Medium", "Large"), size=150, replace=TRUE)) %>%
  select(Stem.Size, Species) %>%
  cat_default(rowlabel = "Stem Size", missing = TRUE, digits = 2) %>%
  kable(escape = F, align = 'l') %>%
  trimws() %>%
  kable_styling(c("striped", "bordered"))
```

# Special note regarding binary rows

Binary row summary functions differ slightly from numerical and categorical rows
because `tangram.pipe`'s prewritten summary functions include three additional arguments.

```{r, eval = FALSE}
binary_default(dt, reference, ref.label, rowlabel, compact, missing, digits)
```

For binary rows, it is recommended that you include the following arguments
as well when writing your own functions:

1. `reference`: This is the reference variable to include in the table. Since binary
data only includes two possible categories, the row function is written so that only
one option will be included in the table. The category you want in the table is the
value of `reference`.

2. `ref.label`: Depending on the label you choose for your binary variable, it
may not make sense to include the name of the reference group alongside the variable
label. This argument allows you to toggle the reference group label. Of the three
additional arguments, this is arguably the lowest-priority one to include in your
custom functions, so it is only recommended to incorporate this if you are interested
in toggling the reference label on and off in your table.

3. `compact`: Often, binary data in tables is written so that the variable name is
eliminated and only the reference group appears in the table. This compacts the row
information into a single for. This TRUE/FALSE variable decides if this is how you
want the data displayed in the table.

The above variables should be included in the body of the user-defined function
so that each is dealt with accordingly. As with numerical and categorical data,
you are not required to account for these arguments in the body of your function, 
but excluding them will result in the `reference`, `ref.label`, and `compact` arguments 
defined in `binary_row` to not be implemented in your table output object.

Since writing functions for binary data can be somewhat more complicated, remember
that any data used in a binary row can also be substituted into a categorical defined 
row instead.

# Additional note regarding categorical rows

As of `tangram.pipe` version 1.1.1 (April 2022), categorical rows can now be sorted
based on a column category label. While the default categorical summary functions,
as well as any custom functions, do not require any sorting argument, if you want
to sort your categorical row in your table, the following two arguments will allow
you to do so.

1. `ordering`: The method for ordering the row variable. It is recommended that
argument accepts values that will determine what type of sorting to do. The default
summary functions use `c("ascending", "descending")` as acceptable arguments, but
you may choose whatever types of sorting and allowable names as you wish.

2. `sortcol`. The category name to sort on. The default packages accept specific
names of column categories on which to do the sort.

As with binary summary functions, these extra arguments are not necessary in order 
for the package to work; they only need to be accounted for if you want to sort
your row variable. `cat_row` assumes `NULL` values for these variables by default.

# User-Defined Comparison Tests
A similar process can be used to write custom functions for comparison tests in `tangram.pipe`.
The user is encouraged to look up the help documentation to prewritten tests for 
their desired row to determine what arguments are necessary for a custom function
to include as input.