The package tangram.pipe
can be used to iteratively
build a table which allows each row to be uniquely customized. All the
possible changes can be seen in the package’s main vignette,
“Customizeable Table Building with Tangram Pipe”. One main package
feature which is not discussed there is that a user may write their own
summary function for the table rows. By default, the package will use
default summary functions to calculate a 5-number summary, plus the mean
and standard deviation, for numeric data; column-wise proportions are
generated for categorical and binary rows. Currently, there are a total
of five prewritten numeric summary functions, as well as four prewritten
functions for both categorical and binary data. However, it is often the
case that a user wants to have increased flexibility and format
tangram.pipe
output tables in a different way than provided
by the currently-available options. This document is intended to walk a
user through how to write a custom-made summary function, as well as
some suggested inputs and outputs to include for user-defined summary
functions using tangram.pipe
.
To see how the default functions for summarizing data work, let’s
take a look at the function usage for summarizing numerical data,
num_default
.
All prewritten summary functions for numerical and categorical data
take on a generic form such as num_default(dt, ...)
, where
only the argument dt
is required. However, in order for
these summary functions to work correctly, a total of four arguments are
passed to each of the functions.
dt
: The dataset to use for the function must be
passed into the summary tool. However, the full dataset cannot be
implemented into the summary function. Based on how the summary is used
in the row functions, the first column of dt
must contain
the row information for the table; the second column should include the
table’s column information, if applicable. Be sure that dt
is a dataframe object.
rowlabel
: This is the label you want to use for the
row in the table. It should match the rowlabel you specify in the
row-defining function.
missing
: A binary TRUE/FALSE variable which tells
the function whether or not to account for missing data. It should match
the designation for missing data from the row function.
digits
: The number of significant digits to use in
the summary.
Each of the prewritten functions use the ellipsis (...
)
in place of the final three arguments to provide flexibility in writing
custom functions. To write your own function, the bare minimum
requirement is that you provide an argument for the dataframe object to
use in the summary. The remaining arguments rowlabel
,
missing
, and digits
are highly recommended to
use within your custom function because it is within the summary that
these values, which are specified in the row-initialization, are
implemented. If you do not include them in your summary function, your
specifications for these arguments in the row initialization will not be
present in your table for the row of interest. Therefore, it is highly
recommended, but not required, for you to include these inputs in your
summaries; excluding them will not break the package.
When you write a custom function, be sure to include all arguments
outside dt
within the ellipsis (...
). This is
because tangram.pipe
’s for functions will input values for
rowlabel
, missing
, and digits
as
done in the prewritten functions. To provide additional flexibility,
using (...
) as the second argument following
dt
will allow for differing arguments to be used while
preventing the custom function from inadvertently breaking the package
row functions. You can call your inputs within the function body by
inputting (...
) into a list and calling the elements of
(...
).
Note that none of the summary function’s arguments should have
default values. Since summary functions are called within the row
functions from tangram.pipe
, the function will end up
taking the values already entered into the row function, so be sure not
to use default values here. Below is an example usage of a generic
summary function, summary_generic
using the
iris
dataset. Here, we want Sepal.Length
to be
the row variable and Species
to be the columns of the main
table. Suppose we want to call the row variable “Sepal Length (cm)”,
account for missing data, and use 2 significant digits. First, we show
the format the data (dt
) must be in to pass it to the
function
iris %>%
select(Sepal.Length, Species) %>%
head() %>%
kable(escape = F, align = 'c') %>%
trimws() %>%
kable_styling(c("striped", "bordered"))
Sepal.Length | Species |
---|---|
5.1 | setosa |
4.9 | setosa |
4.7 | setosa |
4.6 | setosa |
5.0 | setosa |
5.4 | setosa |
Note that the row variable is on the left and the column variable is
to the right. If you wanted to avoid splitting by Species
,
you would only pass the Sepal.Length
information into the
summary function.
Now, we show the code input needed for our generic summary function.
When writing your summary function, it is important to take note of a few aspects the function should be sure to incorporate within its text. The first important check to make is whether your data includes a column variable or not. The data will be structured differently depending on whether or not your data has two columns or only one, so be sure your function can handle both types of data.
Second, you will likely want to label your variable using the name
specified in rowlabel
. It is in the summary that the
rowlabel specified in the row-initializing function is added to the
table, so if you neclect this step, the final table will not have the
label the user specifies during row initialization.
Next, the function needs to calculate the amount of missing data if
missing = TRUE
in the function. As with the
rowlabel
function, missing data handling is specified in
the row initialization but calculated within the summary function, so be
sure to write the function in such a way that missing data can be
handled if specified.
Finally, be sure that all summary statistics are rounded based on the
digits
argument. The round
and
sprintf
functions are common tools used to accomplish this
so the table output can have a polished look.
At a minimum, the output of each function should be a dataframe
object. Any other object type will cause the row function to fail since
the final table, as well as any comparison tests, need dataframes to
combine the results together into the finished product. The rightmost
column should also be the “Overall” column which contains the summary
statistics for the dataset as a whole without accounting for the table’s
column variable. This is because the row functions will eliminate this
column if the user sets overall = FALSE
during row
initialization.
So long as the above two requirements are met, the summary function
will not break the preexisting functions of tangram.pipe
.
However, there are certain naming recommendations so that the table is
formatted well. Ideally, the row name should be in the first column,
called “Variable”. A column labelling the type of measurement, called
“Measurement”, as well as naming the overall column “Overall”, will keep
the column names consistent with the package’s default summary
functions. If you decide to use different names, it is recommended that
you keep the names consistent with each new summary function that you
use and that you do not mix in rows with default summary functions;
mixing in different naming conventions will cause information that
should be contained in one column being spread out over multiple
columns.
Below are example outputs from the preexisting
num_default
and cat_default
summary functions.
It is recommended that you include columns for the variable name, the
measure type, the column categories (if applicable), and the overall
column, being sure to keep naming conventions consistent. As of version
1.1.0, tangram.pipe
default summaries now also calculate
the total number of instances, N.
iris %>%
select(Sepal.Length, Species) %>%
num_default(rowlabel = "Sepal Length (cm)", missing = TRUE, digits = 2) %>%
kable(escape = F, align = 'l') %>%
trimws() %>%
kable_styling(c("striped", "bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Sepal Length (cm) | min | 4.30 | 4.90 | 4.90 | 4.30 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |
median | 5.00 | 5.90 | 6.50 | 5.80 | |
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |
max | 5.80 | 7.00 | 7.90 | 7.90 | |
mean | 5.01 | 5.94 | 6.61 | 5.84 | |
SD | 0.35 | 0.52 | 0.64 | 0.83 | |
Missing | 0 | 0 | 1 | 1 |
iris %>%
mutate(Stem.Size = sample(c("Small", "Medium", "Medium", "Large"), size=150, replace=TRUE)) %>%
select(Stem.Size, Species) %>%
cat_default(rowlabel = "Stem Size", missing = TRUE, digits = 2) %>%
kable(escape = F, align = 'l') %>%
trimws() %>%
kable_styling(c("striped", "bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Stem Size | Col. Prop. (N) | ||||
Large | 0.36 (18) | 0.26 (13) | 0.22 (11) | 0.28 (42) | |
Medium | 0.44 (22) | 0.46 (23) | 0.59 (29) | 0.50 (74) | |
Small | 0.20 (10) | 0.28 (14) | 0.18 (9) | 0.22 (33) |
Binary row summary functions differ slightly from numerical and
categorical rows because tangram.pipe
’s prewritten summary
functions include three additional arguments.
For binary rows, it is recommended that you include the following arguments as well when writing your own functions:
reference
: This is the reference variable to include
in the table. Since binary data only includes two possible categories,
the row function is written so that only one option will be included in
the table. The category you want in the table is the value of
reference
.
ref.label
: Depending on the label you choose for
your binary variable, it may not make sense to include the name of the
reference group alongside the variable label. This argument allows you
to toggle the reference group label. Of the three additional arguments,
this is arguably the lowest-priority one to include in your custom
functions, so it is only recommended to incorporate this if you are
interested in toggling the reference label on and off in your
table.
compact
: Often, binary data in tables is written so
that the variable name is eliminated and only the reference group
appears in the table. This compacts the row information into a single
for. This TRUE/FALSE variable decides if this is how you want the data
displayed in the table.
The above variables should be included in the body of the
user-defined function so that each is dealt with accordingly. As with
numerical and categorical data, you are not required to account for
these arguments in the body of your function, but excluding them will
result in the reference
, ref.label
, and
compact
arguments defined in binary_row
to not
be implemented in your table output object.
Since writing functions for binary data can be somewhat more complicated, remember that any data used in a binary row can also be substituted into a categorical defined row instead.
As of tangram.pipe
version 1.1.1 (April 2022),
categorical rows can now be sorted based on a column category label.
While the default categorical summary functions, as well as any custom
functions, do not require any sorting argument, if you want to sort your
categorical row in your table, the following two arguments will allow
you to do so.
ordering
: The method for ordering the row variable.
It is recommended that argument accepts values that will determine what
type of sorting to do. The default summary functions use
c("ascending", "descending")
as acceptable arguments, but
you may choose whatever types of sorting and allowable names as you
wish.
sortcol
. The category name to sort on. The default
packages accept specific names of column categories on which to do the
sort.
As with binary summary functions, these extra arguments are not
necessary in order for the package to work; they only need to be
accounted for if you want to sort your row variable.
cat_row
assumes NULL
values for these
variables by default.
A similar process can be used to write custom functions for
comparison tests in tangram.pipe
. The user is encouraged to
look up the help documentation to prewritten tests for their desired row
to determine what arguments are necessary for a custom function to
include as input.