Package 'autoBagging'

Title: Learning to Rank Bagging Workflows with Metalearning
Description: A framework for automated machine learning. Concretely, the focus is on the optimisation of bagging workflows. A bagging workflows is composed by three phases: (i) generation: which and how many predictive models to learn; (ii) pruning: after learning a set of models, the worst ones are cut off from the ensemble; and (iii) integration: how the models are combined for predicting a new observation. autoBagging optimises these processes by combining metalearning and a learning to rank approach to learn from metadata. It automatically ranks 63 bagging workflows by exploiting past performance and dataset characterization. A complete description of the method can be found in: Pinto, F., Cerqueira, V., Soares, C., Mendes-Moreira, J. (2017): "autoBagging: Learning to Rank Bagging Workflows with Metalearning" arXiv preprint arXiv:1706.09367.
Authors: Fabio Pinto [aut], Vitor Cerqueira [cre], Carlos Soares [ctb], Joao Mendes-Moreira [ctb]
Maintainer: Vitor Cerqueira <[email protected]>
License: GPL (>= 2)
Version: 0.1.0
Built: 2025-03-10 02:33:34 UTC
Source: https://github.com/cran/autoBagging

Help Index


abmodel

Description

abmodel

Usage

abmodel(base_models, form, data, dynamic_selection)

Arguments

base_models

a list of decision tree classifiers

form

formula

data

dataset used to train base_models

dynamic_selection

the dynamic selection/combination method to use to aggregate predictions. If none, majority vote is used.


abmodel-class

Description

abmodel is an S4 class that contains the ensemble model. Besides the base learning algorithms–base_modelsabmodel class contains information about the dynamic selection method to apply in new data.

Slots

base_models

a list of decision tree classifiers

form

formula

data

dataset used to train base_models

dynamic_selection

the dynamic selection/combination method to use to aggregate predictions. If none, majority vote is used.

See Also

autoBagging function for the method of automatic predicting of the best workflows.


autoBagging

Description

Learning to Rank Bagging Workflows with Metalearning

Machine Learning (ML) has been successfully applied to a wide range of domains and applications. One of the techniques behind most of these successful applications is Ensemble Learning (EL), the field of ML that gave birth to methods such as Random Forests or Boosting. The complexity of applying these techniques together with the market scarcity on ML experts, has created the need for systems that enable a fast and easy drop-in replacement for ML libraries. Automated machine learning (autoML) is the field of ML that attempts to answers these needs. Typically, these systems rely on optimization techniques such as bayesian optimization to lead the search for the best model. Our approach differs from these systems by making use of the most recent advances on metalearning and a learning to rank approach to learn from metadata. We propose autoBagging, an autoML system that automatically ranks 63 bagging workflows by exploiting past performance and dataset characterization. Results on 140 classification datasets from the OpenML platform show that autoBagging can yield better performance than the Average Rank method and achieve results that are not statistically different from an ideal model that systematically selects the best workflow for each dataset.

Usage

autoBagging(form, data)

Arguments

form

formula. Currently supporting only categorical target variables (classification tasks)

data

training dataset with a categorical target variable

Details

The underlying model leverages the performance of the workflows in historical data. It ranks and recommends workflows for a given classification task. A bagging workflow is comprised by the following steps:

generation

the number of trees to grow

pruning

the pruning of low performing trees in the ensemble

pruning cut-point

a parameter of the previous step

dynamic selection

the dynamic selection method used to aggregate predictions. If none is recommended, majority voting is used.

Value

an abmodel class object

References

Pinto, F., Cerqueira, V., Soares, C., Mendes-Moreira, J.: "autoBagging: Learning to Rank Bagging Workflows with Metalearning" arXiv preprint arXiv:1706.09367 (2017).

See Also

bagging for the bagging pipeline with a specific workflow; baggedtrees for the bagging implementation; abmodel-class for the returning class object.

Examples

## Not run: 
# splitting an example dataset into train/test:
train <- iris[1:(.7*nrow(iris)), ]
test <- iris[-c(1:(.7*nrow(iris))), ]
# then apply autoBagging to the train, using the desired formula:
# autoBagging will compute metafeatures on the dataset
# and apply a pre-trained ranking model to recommend a workflow.
model <- autoBagging(Species ~., train)
# predictions are produced with the standard predict method
preds <- predict(model, test)

## End(Not run)

bagged trees models

Description

The standard resampling with replacement (bootstrap) is used as sampling strategy.

Usage

baggedtrees(form, data, ntree = 100)

Arguments

form

formula

data

training data

ntree

no of trees

Examples

ensemble <- baggedtrees(Species ~., iris, ntree = 50)

bagging method

Description

bagging method

Usage

bagging(form, data, ntrees, pruning, dselection, pruning_cp)

Arguments

form

formula

data

training data

ntrees

ntrees

pruning

model pruning method. A character vector. Currently, the following methods are supported:

mdsq

Margin-distance minimisation

bb

boosting based pruning

none

no pruning

dselection

dynamic selection of the available models. Currently, the following methods are supported:

ola

Overall Local Accuracy

knora-e

K-nearest-oracles-eliminate

none

no dynamic selection. Majority voting is used.

pruning_cp

The pruning cutpoint for the pruning method picked.

See Also

baggedtrees for the implementation of the bagging model.

Examples

# splitting an example dataset into train/test:
train <- iris[1:(.7*nrow(iris)), ]
test <- iris[-c(1:(.7*nrow(iris))), ]
form <- Species ~.
# a user-defined bagging workflow
m <- bagging(form, iris, ntrees = 5, pruning = "bb", pruning_cp = .5, dselection = "ola")
preds <- predict(m, test)
# a standard bagging workflow with 5 trees (5 trees for examplification purposes):
m2 <- bagging(form, iris, ntrees = 5, pruning = "none", dselection = "none")
preds2 <- predict(m2, test)

Boosting-based pruning of models

Description

Boosting-based pruning of models

Usage

bb(form, preds, data, cutPoint)

Arguments

form

formula

preds

predictions in training data

data

training data

cutPoint

ratio of the total number of models to cut off


classmajority.landmarker

Description

classmajority.landmarker

Usage

classmajority.landmarker(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


classmajority.landmarker.correlation

Description

classmajority.landmarker.correlation

Usage

classmajority.landmarker.correlation(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


classmajority.landmarker.entropy

Description

classmajority.landmarker.entropy

Usage

classmajority.landmarker.entropy(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


classmajority.landmarker.interinfo

Description

classmajority.landmarker.interinfo

Usage

classmajority.landmarker.interinfo(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


classmajority.landmarker.mutual.information

Description

classmajority.landmarker.mutual.information

Usage

classmajority.landmarker.mutual.information(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


Retrieve names of continuous attributes (not including the target)

Description

Retrieve names of continuous attributes (not including the target)

Usage

ContAttrs(dataset)

Arguments

dataset

structure describing the data set, according to read_data.R

Value

list of strings

See Also

read_data.R


dstump.landmarker_d1

Description

dstump.landmarker_d1

Usage

dstump.landmarker_d1(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d1.correlation

Description

dstump.landmarker_d1.correlation

Usage

dstump.landmarker_d1.correlation(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d1.entropy

Description

dstump.landmarker_d1.entropy

Usage

dstump.landmarker_d1.entropy(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d1.interinfo

Description

dstump.landmarker_d1.interinfo

Usage

dstump.landmarker_d1.interinfo(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d1.mutual.information

Description

dstump.landmarker_d1.mutual.information

Usage

dstump.landmarker_d1.mutual.information(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d2

Description

dstump.landmarker_d2

Usage

dstump.landmarker_d2(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d2.correlation

Description

dstump.landmarker_d2.correlation

Usage

dstump.landmarker_d2.correlation(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d2.entropy

Description

dstump.landmarker_d2.entropy

Usage

dstump.landmarker_d2.entropy(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d2.interinfo

Description

dstump.landmarker_d2.interinfo

Usage

dstump.landmarker_d2.interinfo(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d2.mutual.information

Description

dstump.landmarker_d2.mutual.information

Usage

dstump.landmarker_d2.mutual.information(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d3

Description

dstump.landmarker_d3

Usage

dstump.landmarker_d3(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d3.correlation

Description

dstump.landmarker_d3.correlation

Usage

dstump.landmarker_d3.correlation(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d3.entropy

Description

dstump.landmarker_d3.entropy

Usage

dstump.landmarker_d3.entropy(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d3.interinfo

Description

dstump.landmarker_d3.interinfo

Usage

dstump.landmarker_d3.interinfo(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


dstump.landmarker_d3.mutual.information

Description

dstump.landmarker_d3.mutual.information

Usage

dstump.landmarker_d3.mutual.information(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


get target variable

Description

get the target variable from a formula

Usage

get_target(form)

Arguments

form

formula


Retrieve the value of a previously computed measure

Description

Retrieve the value of a previously computed measure

Usage

GetMeasure(inDCName, inDCSet, component.name = "value")

Arguments

inDCName

name of data characteristics

inDCSet

set of data characteristics already computed

component.name

name of component (e.g. time or value) to retrieve; if NULL retrieve all

Value

simple or structured value

Note

if measure is not available, stop execution with error


K-Nearest-ORAcle-Eliminate

Description

A dynamic selection method

Usage

KNORA.E(form, mod, v.data, t.data, k = 5)

Arguments

form

formula

mod

a list comprising the individual models

v.data

validation data

t.data

test data, with the instances to predict

k

the number of nearest neighbors. Defaults to 5.


lda.landmarker.correlation

Description

lda.landmarker.correlation

Usage

## S3 method for class 'landmarker.correlation'
lda(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


majority voting

Description

majority voting

Usage

majority_voting(x)

Arguments

x

predictions produced by a set of models


Margin Distance Minimization

Description

Margin Distance Minimization

Usage

mdsq(form, preds, data, cutPoint)

Arguments

form

formula

preds

predictions in training data

data

training data

cutPoint

ratio of the total number of models to cut off


nb.landmarker

Description

nb.landmarker

Usage

nb.landmarker(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


nb.landmarker.correlation

Description

nb.landmarker.correlation

Usage

nb.landmarker.correlation(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


nb.landmarker.entropy

Description

nb.landmarker.entropy

Usage

nb.landmarker.entropy(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


nb.landmarker.interinfo

Description

nb.landmarker.interinfo

Usage

nb.landmarker.interinfo(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


nb.landmarker.mutual.information

Description

nb.landmarker.mutual.information

Usage

nb.landmarker.mutual.information(dataset, data.char)

Arguments

dataset

train data for the landmarker

data.char

dc


Overall Local Accuracy

Description

A dynamic selection method

Usage

OLA(form, mod, v.data, t.data, k = 5)

Arguments

form

formula

mod

a list comprising the individual models

v.data

validation data

t.data

test data, with the instances to predict

k

the number of nearest neighbors. Defaults to 5.


Predicting on new data with a abmodel model

Description

This is a predict method for predicting new data points using a abmodel class object - refering to an ensemble of bagged trees

Usage

## S4 method for signature 'abmodel'
predict(object, newdata)

Arguments

object

A abmodel-class object.

newdata

New data to predict using an abmodel object

Value

predictions produced by an abmodel model.

See Also

abmodel-class for details about the bagging model;


FUNCTION TO TRANSFORM DATA FRAME INTO LIST WITH GSI REQUIREMENTS

Description

FUNCTION TO TRANSFORM DATA FRAME INTO LIST WITH GSI REQUIREMENTS

Usage

ReadDF(dat)

Arguments

dat

data frame

Value

a list containing components that describe the names (see ReadtAttrsInfo) and the data (see ReadData) files

THIS FUNCTION HAS TO BE BASED IN READATTRSINFO AND READDATA


Retrieve names of symbolic attributes (not including the target)

Description

Retrieve names of symbolic attributes (not including the target)

Usage

SymbAttrs(dataset)

Arguments

dataset

structure describing the data set, according to read_data.R

Value

list of strings

See Also

read_data.R


sysdata

Description

Meta data needed to run the autoBagging method.

Usage

sysdata

Format

a list comprising the following information

avgRankMatrix

the average rank data regarding each bagging workflow

workflows

metadata on the bagging workflows

MaxMinMetafeatures

range data on each metafeature

metafeatures

names and values of each metafeatures used to describe the datasets

metamodel

the xgboost ranking metamodel