Title: | To Inspect and Manipulate Data; and to Keep Track of This Process |
---|---|
Description: | Functions to work with data frames to prepare data for further analysis. The functions for imputation, encoding, partitioning, and other manipulation can produce log files to keep track of process. |
Authors: | Sherry Zhao |
Maintainer: | Sherry Zhao <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0 |
Built: | 2024-11-13 02:42:30 UTC |
Source: | https://github.com/sherrisherry/cleandata |
The return value of inspect_map
can be used to create inputs for the following fuctions. Refer to vignettes for examples.
encode_ordinal
: Encode Ordinal Data Into Sequential Integers
encode_binary
: Encode Binary Data Into 0 and 1
encode_onehot
: Encode categorical data by One-hot encoding
Encodes binary data into 0 and 1. Optionally records the result into a log file.
encode_binary(x, out.int=FALSE, full_print=TRUE, log = eval.parent(in_log_default))
encode_binary(x, out.int=FALSE, full_print=TRUE, log = eval.parent(in_log_default))
x |
The data frame |
out.int |
Whether to convert encoded |
full_print |
When set to |
log |
Controls log files. To produce log files, assign it or the |
An encoded data frame.
x
can only be a data frame. Don't pass a vector to it.
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('x', 'y', 'x')) B <- as.factor(c('y', 'x', 'y')) C <- as.factor(c('i', 'j', 'i')) df <- data.frame(A, B, C) # encoding df <- encode_binary(df) print(df)
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('x', 'y', 'x')) B <- as.factor(c('y', 'x', 'y')) C <- as.factor(c('i', 'j', 'i')) df <- data.frame(A, B, C) # encoding df <- encode_binary(df) print(df)
Encodes categorical data by One-hot encoding. Optionally records the result into a log file.
encode_onehot(x, colname.sep = '_', drop1st = FALSE, full_print=TRUE, log = eval.parent(in_log_default))
encode_onehot(x, colname.sep = '_', drop1st = FALSE, full_print=TRUE, log = eval.parent(in_log_default))
x |
The data frame |
colname.sep |
A character or string that acts as an divider in the names of the columns of encoding results. |
drop1st |
Whether drop the 1st level of every encoded column. The 1st level refers to the level that corresponds to 1 in a factor. |
full_print |
When set to |
log |
Controls log files. To produce log files, assign it or the |
An encoded data frame.
x
can only be a data frame. Don't pass a vector to it.
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('x', 'y', 'x')) B <- as.factor(c('i', 'j', 'k')) df <- data.frame(A, B) # encoding df0 <- encode_onehot(df) df0 <- cbind(df, df0) print(df0) df0 <- encode_onehot(df, colname.sep = '-', drop1st = TRUE) df0 <- cbind(df, df0) rm(df) print(df0)
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('x', 'y', 'x')) B <- as.factor(c('i', 'j', 'k')) df <- data.frame(A, B) # encoding df0 <- encode_onehot(df) df0 <- cbind(df, df0) print(df0) df0 <- encode_onehot(df, colname.sep = '-', drop1st = TRUE) df0 <- cbind(df, df0) rm(df) print(df0)
Encodes ordinal data into sequential integers by a given order. Optionally records the result into a log file.
encode_ordinal(x, order, none='', out.int=FALSE, full_print=TRUE, log = eval.parent(in_log_default))
encode_ordinal(x, order, none='', out.int=FALSE, full_print=TRUE, log = eval.parent(in_log_default))
x |
The data frame |
order |
a vector of the ordered labels from low to high. |
none |
The 'none'-but-not-'NA' level, which is always encoded to 0. |
out.int |
Whether to convert encoded |
full_print |
When set to |
log |
Controls log files. To produce log files, assign it or the |
An encoded data frame.
x
can only be a data frame. Don't pass a vector to it.
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('y', 'z', 'x', 'y', 'z')) B <- as.factor(c('y', 'x', 'z', 'z', 'x')) C <- as.factor(c('k', 'i', 'i', 'j', 'k')) df <- data.frame(A, B, C) # encoding df[, 1:2] <- encode_ordinal(df[,1:2], order = c('z', 'x', 'y')) df[, 3] <- encode_ordinal(df[, 3, drop = FALSE], order = c('k', 'j', 'i')) print(df)
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('y', 'z', 'x', 'y', 'z')) B <- as.factor(c('y', 'x', 'z', 'z', 'x')) C <- as.factor(c('k', 'i', 'i', 'j', 'k')) df <- data.frame(A, B, C) # encoding df[, 1:2] <- encode_ordinal(df[,1:2], order = c('z', 'x', 'y')) df[, 3] <- encode_ordinal(df[, 3, drop = FALSE], order = c('k', 'j', 'i')) print(df)
impute_mode
: Impute NA
s by the modes of their corresponding columns.
impute_median
: Impute NA
s by the medians of their corresponding columns.
impute_mean
: Impute NA
s by the means of their corresponding columns.
impute_mode(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default)) impute_median(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default)) impute_mean(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default))
impute_mode(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default)) impute_median(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default)) impute_mean(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default))
x |
The data frame to be imputed. |
cols |
The index of columns of |
idx |
The index of rows of |
log |
Controls log files. To produce log files, assign it or the |
An imputed data frame.
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('y', 'x', 'x', 'y', 'z')) B <- c(6, 3:6) C <- 1:5 df <- data.frame(A, B, C) df[3, 1] <- NA; df[2, 2] <- NA; df [5, 3] <- NA print(df) # imputation df0 <- impute_mode(df, cols = 1:3) print(df0) df0 <- impute_mode(df, cols = 1:3, idx = 1:3) print(df0) df0 <- impute_median(df, cols = 2:3) print(df0) df0 <- impute_mean(df, cols = 2:3) print(df0)
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- as.factor(c('y', 'x', 'x', 'y', 'z')) B <- c(6, 3:6) C <- 1:5 df <- data.frame(A, B, C) df[3, 1] <- NA; df[2, 2] <- NA; df [5, 3] <- NA print(df) # imputation df0 <- impute_mode(df, cols = 1:3) print(df0) df0 <- impute_mode(df, cols = 1:3, idx = 1:3) print(df0) df0 <- impute_median(df, cols = 2:3) print(df0) df0 <- impute_mean(df, cols = 2:3) print(df0)
Provide a map for imputation and encoding.
inspect_map(x, common = 0, message = TRUE)
inspect_map(x, common = 0, message = TRUE)
x |
The data frame |
common |
a non-negative numerical parameter, if 2 factorial columns share more than 'common' levels, they share the same scheme. 0 means all the levels should be the same for 2 factorial columns to share the same scheme. |
message |
Whether print the process. |
A list of factor_cols
(list), factor_levels
(list), num_cols
(vector), char_cols
(vector), ordered_cols
(vector), and other_cols
(vector).
factor_cols |
a list, in which each member is a vector of the names of the factorial columns that share the same scheme. The name of a vector is the same as its 1st member. Refer to the argument |
factor_levels |
a list, in which each member is a scheme of the factorial columns. The name of a scheme is the same as its corresponding vector in |
num_cols |
a vector, in which are the names of the numerical columns. |
char_cols |
a vector, in which are the names of the string columns. |
ordered_cols |
a vector, in which are the names of the ordered factorial columns. |
other_cols |
a vector, in which are the names of the other columns. |
# building a data frame A <- as.factor(c('x', 'y', 'z')) B <- as.ordered(c('z', 'x', 'y')) C <- as.factor(c('y', 'z', 'x')) D <- as.factor(c('i', 'j', 'k')) E <- 5:7 df <- data.frame(A, B, C, D, E) # inspection dmap <- inspect_map(df) summary(dmap) print(dmap)
# building a data frame A <- as.factor(c('x', 'y', 'z')) B <- as.ordered(c('z', 'x', 'y')) C <- as.factor(c('y', 'z', 'x')) D <- as.factor(c('i', 'j', 'k')) E <- 5:7 df <- data.frame(A, B, C, D, E) # inspection dmap <- inspect_map(df) summary(dmap) print(dmap)
Return the names and numbers of NAs of columns that have top # (refer to argument top
) most NAs.
inspect_na(x, top=ncol(x))
inspect_na(x, top=ncol(x))
x |
The data frame |
top |
The value of #. |
A named vector.
# building a data frame A <- as.factor(c('y', 'x', 'x', 'y', 'z')) B <- c(6, 3:6) C <- 1:5 df <- data.frame(A, B, C) df[3, 1] <- NA; df[2, 2] <- NA; df[4, 2] <- NA; df [5, 3] <- NA print(df) # inspection a <- inspect_na(df) print(a)
# building a data frame A <- as.factor(c('y', 'x', 'x', 'y', 'z')) B <- c(6, 3:6) C <- 1:5 df <- data.frame(A, B, C) df[3, 1] <- NA; df[2, 2] <- NA; df[4, 2] <- NA; df [5, 3] <- NA print(df) # inspection a <- inspect_na(df) print(a)
A simplified thus faster version of inspect_map
.
inspect_smap(x, message = TRUE)
inspect_smap(x, message = TRUE)
x |
The data frame |
message |
Whether print the process. |
A list of factor_cols
(vector), num_cols
(vector), char_cols
(vector), ordered_cols
(vector), and other_cols
(vector).
factor_cols |
a vector, in which are the names of the factorial columns. |
num_cols |
a vector, in which are the names of the numerical columns. |
char_cols |
a vector, in which are the names of the string columns. |
ordered_cols |
a vector, in which are the names of the ordered factorial columns. |
other_cols |
a vector, in which are the names of the other columns. |
# building a data frame A <- as.factor(c('x', 'y', 'z')) B <- as.ordered(c('z', 'x', 'y')) C <- as.factor(c('y', 'z', 'x')) D <- as.factor(c('i', 'j', 'k')) E <- 5:7 df <- data.frame(A, B, C, D, E) # inspection dmap <- inspect_smap(df) summary(dmap) print(dmap)
# building a data frame A <- as.factor(c('x', 'y', 'z')) B <- as.ordered(c('z', 'x', 'y')) C <- as.factor(c('y', 'z', 'x')) D <- as.factor(c('i', 'j', 'k')) E <- 5:7 df <- data.frame(A, B, C, D, E) # inspection dmap <- inspect_smap(df) summary(dmap) print(dmap)
Designed to create a validation column. Optionally records the result into a log file.
partition_random(x, name = 'Partition', train, val = 10^ceiling(log10(train))-train, test = TRUE, seed = FALSE, log = eval.parent(in_log_default))
partition_random(x, name = 'Partition', train, val = 10^ceiling(log10(train))-train, test = TRUE, seed = FALSE, log = eval.parent(in_log_default))
x |
The data frame |
name |
The name of the validation column. |
train |
The proportion of the training set. |
val |
The proportion of the validation set. If not given, a default value is calculated by assuming the sum of |
test |
Whether to have test set. If |
seed |
Whether to set a random seed. If you want a reproducible result, pass a number to |
log |
Controls log files. To produce log files, assign it or the |
A partitioned column.
x
can only be a data frame. Don't pass a vector to it.
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- 2:16 B <- letters[12:26] df <- data.frame(A, B) # partitioning df0 <- partition_random(df, train = 7) df0 <- cbind(df, df0) print(df0) df0 <- partition_random(df, train = 7, val = 2) df0 <- cbind(df, df0) print(df0)
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- 2:16 B <- letters[12:26] df <- data.frame(A, B) # partitioning df0 <- partition_random(df, train = 7) df0 <- cbind(df, df0) print(df0) df0 <- partition_random(df, train = 7, val = 2) df0 <- cbind(df, df0) print(df0)
Stacks part of a data frame and repeat the other columns to fit the result of stacking. Optionally records the result into a log file.
wh_dict(x, attr, value)
wh_dict(x, attr, value)
x |
The data frame |
attr |
The index of the column in |
value |
The index of the column in |
A 2-column data frame, in which the Keys
column stores the explanation of the values in x[, attr]
.
x
can only be a data frame. Don't pass a vector to it.
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- c('i', 'j', 'i', 'k', 'j') B <- as.factor(c('x', 'y', 'x', 'z', 'y')) C <- 1:5 df <- data.frame(A, B, C) print(df) # encoding dict <- wh_dict(df, attr = 'B', value = 'A') print(dict)
# refer to vignettes if you want to use log files message('refer to vignettes if you want to use log files') # building a data frame A <- c('i', 'j', 'i', 'k', 'j') B <- as.factor(c('x', 'y', 'x', 'z', 'y')) C <- 1:5 df <- data.frame(A, B, C) print(df) # encoding dict <- wh_dict(df, attr = 'B', value = 'A') print(dict)