Title: | Learning to Rank Bagging Workflows with Metalearning |
---|---|
Description: | A framework for automated machine learning. Concretely, the focus is on the optimisation of bagging workflows. A bagging workflows is composed by three phases: (i) generation: which and how many predictive models to learn; (ii) pruning: after learning a set of models, the worst ones are cut off from the ensemble; and (iii) integration: how the models are combined for predicting a new observation. autoBagging optimises these processes by combining metalearning and a learning to rank approach to learn from metadata. It automatically ranks 63 bagging workflows by exploiting past performance and dataset characterization. A complete description of the method can be found in: Pinto, F., Cerqueira, V., Soares, C., Mendes-Moreira, J. (2017): "autoBagging: Learning to Rank Bagging Workflows with Metalearning" arXiv preprint arXiv:1706.09367. |
Authors: | Fabio Pinto [aut], Vitor Cerqueira [cre], Carlos Soares [ctb], Joao Mendes-Moreira [ctb] |
Maintainer: | Vitor Cerqueira <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.0 |
Built: | 2025-03-10 02:33:34 UTC |
Source: | https://github.com/cran/autoBagging |
abmodel
abmodel(base_models, form, data, dynamic_selection)
abmodel(base_models, form, data, dynamic_selection)
base_models |
a list of decision tree classifiers |
form |
formula |
data |
dataset used to train |
dynamic_selection |
the dynamic selection/combination method
to use to aggregate predictions. If |
abmodel is an S4 class that contains the ensemble model.
Besides the base learning algorithms–base_models
–
abmodel class contains information about the
dynamic selection method to apply in new data.
base_models
a list of decision tree classifiers
form
formula
data
dataset used to train base_models
dynamic_selection
the dynamic selection/combination method
to use to aggregate predictions. If none
, majority vote is used.
autoBagging
function for the
method of automatic predicting of the best workflows.
Learning to Rank Bagging Workflows with Metalearning
Machine Learning (ML) has been successfully applied to a wide range of domains and applications. One of the techniques behind most of these successful applications is Ensemble Learning (EL), the field of ML that gave birth to methods such as Random Forests or Boosting. The complexity of applying these techniques together with the market scarcity on ML experts, has created the need for systems that enable a fast and easy drop-in replacement for ML libraries. Automated machine learning (autoML) is the field of ML that attempts to answers these needs. Typically, these systems rely on optimization techniques such as bayesian optimization to lead the search for the best model. Our approach differs from these systems by making use of the most recent advances on metalearning and a learning to rank approach to learn from metadata. We propose autoBagging, an autoML system that automatically ranks 63 bagging workflows by exploiting past performance and dataset characterization. Results on 140 classification datasets from the OpenML platform show that autoBagging can yield better performance than the Average Rank method and achieve results that are not statistically different from an ideal model that systematically selects the best workflow for each dataset.
autoBagging(form, data)
autoBagging(form, data)
form |
formula. Currently supporting only categorical target variables (classification tasks) |
data |
training dataset with a categorical target variable |
The underlying model leverages the performance of the workflows in historical data. It ranks and recommends workflows for a given classification task. A bagging workflow is comprised by the following steps:
the number of trees to grow
the pruning of low performing trees in the ensemble
a parameter of the previous step
the dynamic selection method used to aggregate predictions. If none is recommended, majority voting is used.
an abmodel
class object
Pinto, F., Cerqueira, V., Soares, C., Mendes-Moreira, J.: "autoBagging: Learning to Rank Bagging Workflows with Metalearning" arXiv preprint arXiv:1706.09367 (2017).
bagging
for the bagging pipeline with a specific
workflow; baggedtrees
for the bagging implementation;
abmodel-class
for the returning class object.
## Not run: # splitting an example dataset into train/test: train <- iris[1:(.7*nrow(iris)), ] test <- iris[-c(1:(.7*nrow(iris))), ] # then apply autoBagging to the train, using the desired formula: # autoBagging will compute metafeatures on the dataset # and apply a pre-trained ranking model to recommend a workflow. model <- autoBagging(Species ~., train) # predictions are produced with the standard predict method preds <- predict(model, test) ## End(Not run)
## Not run: # splitting an example dataset into train/test: train <- iris[1:(.7*nrow(iris)), ] test <- iris[-c(1:(.7*nrow(iris))), ] # then apply autoBagging to the train, using the desired formula: # autoBagging will compute metafeatures on the dataset # and apply a pre-trained ranking model to recommend a workflow. model <- autoBagging(Species ~., train) # predictions are produced with the standard predict method preds <- predict(model, test) ## End(Not run)
The standard resampling with replacement (bootstrap) is used as sampling strategy.
baggedtrees(form, data, ntree = 100)
baggedtrees(form, data, ntree = 100)
form |
formula |
data |
training data |
ntree |
no of trees |
ensemble <- baggedtrees(Species ~., iris, ntree = 50)
ensemble <- baggedtrees(Species ~., iris, ntree = 50)
bagging method
bagging(form, data, ntrees, pruning, dselection, pruning_cp)
bagging(form, data, ntrees, pruning, dselection, pruning_cp)
form |
formula |
data |
training data |
ntrees |
ntrees |
pruning |
model pruning method. A character vector. Currently, the following methods are supported:
|
dselection |
dynamic selection of the available models. Currently, the following methods are supported:
|
pruning_cp |
The pruning cutpoint for the |
baggedtrees
for the implementation of the bagging model.
# splitting an example dataset into train/test: train <- iris[1:(.7*nrow(iris)), ] test <- iris[-c(1:(.7*nrow(iris))), ] form <- Species ~. # a user-defined bagging workflow m <- bagging(form, iris, ntrees = 5, pruning = "bb", pruning_cp = .5, dselection = "ola") preds <- predict(m, test) # a standard bagging workflow with 5 trees (5 trees for examplification purposes): m2 <- bagging(form, iris, ntrees = 5, pruning = "none", dselection = "none") preds2 <- predict(m2, test)
# splitting an example dataset into train/test: train <- iris[1:(.7*nrow(iris)), ] test <- iris[-c(1:(.7*nrow(iris))), ] form <- Species ~. # a user-defined bagging workflow m <- bagging(form, iris, ntrees = 5, pruning = "bb", pruning_cp = .5, dselection = "ola") preds <- predict(m, test) # a standard bagging workflow with 5 trees (5 trees for examplification purposes): m2 <- bagging(form, iris, ntrees = 5, pruning = "none", dselection = "none") preds2 <- predict(m2, test)
Boosting-based pruning of models
bb(form, preds, data, cutPoint)
bb(form, preds, data, cutPoint)
form |
formula |
preds |
predictions in training data |
data |
training data |
cutPoint |
ratio of the total number of models to cut off |
classmajority.landmarker
classmajority.landmarker(dataset, data.char)
classmajority.landmarker(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
classmajority.landmarker.correlation
classmajority.landmarker.correlation(dataset, data.char)
classmajority.landmarker.correlation(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
classmajority.landmarker.entropy
classmajority.landmarker.entropy(dataset, data.char)
classmajority.landmarker.entropy(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
classmajority.landmarker.interinfo
classmajority.landmarker.interinfo(dataset, data.char)
classmajority.landmarker.interinfo(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
classmajority.landmarker.mutual.information
classmajority.landmarker.mutual.information(dataset, data.char)
classmajority.landmarker.mutual.information(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
Retrieve names of continuous attributes (not including the target)
ContAttrs(dataset)
ContAttrs(dataset)
dataset |
structure describing the data set, according
to |
list of strings
read_data.R
dstump.landmarker_d1
dstump.landmarker_d1(dataset, data.char)
dstump.landmarker_d1(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d1.correlation
dstump.landmarker_d1.correlation(dataset, data.char)
dstump.landmarker_d1.correlation(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d1.entropy
dstump.landmarker_d1.entropy(dataset, data.char)
dstump.landmarker_d1.entropy(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d1.interinfo
dstump.landmarker_d1.interinfo(dataset, data.char)
dstump.landmarker_d1.interinfo(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d1.mutual.information
dstump.landmarker_d1.mutual.information(dataset, data.char)
dstump.landmarker_d1.mutual.information(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d2
dstump.landmarker_d2(dataset, data.char)
dstump.landmarker_d2(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d2.correlation
dstump.landmarker_d2.correlation(dataset, data.char)
dstump.landmarker_d2.correlation(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d2.entropy
dstump.landmarker_d2.entropy(dataset, data.char)
dstump.landmarker_d2.entropy(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d2.interinfo
dstump.landmarker_d2.interinfo(dataset, data.char)
dstump.landmarker_d2.interinfo(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d2.mutual.information
dstump.landmarker_d2.mutual.information(dataset, data.char)
dstump.landmarker_d2.mutual.information(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d3
dstump.landmarker_d3(dataset, data.char)
dstump.landmarker_d3(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d3.correlation
dstump.landmarker_d3.correlation(dataset, data.char)
dstump.landmarker_d3.correlation(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d3.entropy
dstump.landmarker_d3.entropy(dataset, data.char)
dstump.landmarker_d3.entropy(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d3.interinfo
dstump.landmarker_d3.interinfo(dataset, data.char)
dstump.landmarker_d3.interinfo(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
dstump.landmarker_d3.mutual.information
dstump.landmarker_d3.mutual.information(dataset, data.char)
dstump.landmarker_d3.mutual.information(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
get the target variable from a formula
get_target(form)
get_target(form)
form |
formula |
Retrieve the value of a previously computed measure
GetMeasure(inDCName, inDCSet, component.name = "value")
GetMeasure(inDCName, inDCSet, component.name = "value")
inDCName |
name of data characteristics |
inDCSet |
set of data characteristics already computed |
component.name |
name of component (e.g. time or value) to retrieve; if NULL retrieve all |
simple or structured value
if measure is not available, stop execution with error
A dynamic selection method
KNORA.E(form, mod, v.data, t.data, k = 5)
KNORA.E(form, mod, v.data, t.data, k = 5)
form |
formula |
mod |
a list comprising the individual models |
v.data |
validation data |
t.data |
test data, with the instances to predict |
k |
the number of nearest neighbors. Defaults to 5. |
lda.landmarker.correlation
## S3 method for class 'landmarker.correlation' lda(dataset, data.char)
## S3 method for class 'landmarker.correlation' lda(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
majority voting
majority_voting(x)
majority_voting(x)
x |
predictions produced by a set of models |
Margin Distance Minimization
mdsq(form, preds, data, cutPoint)
mdsq(form, preds, data, cutPoint)
form |
formula |
preds |
predictions in training data |
data |
training data |
cutPoint |
ratio of the total number of models to cut off |
nb.landmarker
nb.landmarker(dataset, data.char)
nb.landmarker(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
nb.landmarker.correlation
nb.landmarker.correlation(dataset, data.char)
nb.landmarker.correlation(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
nb.landmarker.entropy
nb.landmarker.entropy(dataset, data.char)
nb.landmarker.entropy(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
nb.landmarker.interinfo
nb.landmarker.interinfo(dataset, data.char)
nb.landmarker.interinfo(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
nb.landmarker.mutual.information
nb.landmarker.mutual.information(dataset, data.char)
nb.landmarker.mutual.information(dataset, data.char)
dataset |
train data for the landmarker |
data.char |
dc |
A dynamic selection method
OLA(form, mod, v.data, t.data, k = 5)
OLA(form, mod, v.data, t.data, k = 5)
form |
formula |
mod |
a list comprising the individual models |
v.data |
validation data |
t.data |
test data, with the instances to predict |
k |
the number of nearest neighbors. Defaults to 5. |
This is a predict
method for predicting new data points using a
abmodel
class object - refering to an ensemble
of bagged trees
## S4 method for signature 'abmodel' predict(object, newdata)
## S4 method for signature 'abmodel' predict(object, newdata)
object |
A abmodel-class object. |
newdata |
New data to predict using an |
predictions produced by an abmodel
model.
abmodel-class
for details about the bagging model;
FUNCTION TO TRANSFORM DATA FRAME INTO LIST WITH GSI REQUIREMENTS
ReadDF(dat)
ReadDF(dat)
dat |
data frame |
a list containing components that describe the names (see ReadtAttrsInfo) and the data (see ReadData) files
THIS FUNCTION HAS TO BE BASED IN READATTRSINFO AND READDATA
Retrieve names of symbolic attributes (not including the target)
SymbAttrs(dataset)
SymbAttrs(dataset)
dataset |
structure describing the data set, according
to |
list of strings
read_data.R
Meta data needed to run the autoBagging method.
sysdata
sysdata
a list comprising the following information
the average rank data regarding each bagging workflow
metadata on the bagging workflows
range data on each metafeature
names and values of each metafeatures used to describe the datasets
the xgboost ranking metamodel