Gradient Boosting Machine¶

Implementation of gradient boosting machine.

This file is a part of Personal Programming Project (PPP) coursework in Computational Materials Science (CMS) M.Sc. course in Technische Universität Bergakademie Freiberg.
This file is a part of the project titled Application of statistical learning to predict material properties.
For given data, methods present in this file apply Gradient Boosting algorithm to tune the model obtained from local regression.

class gradientMachine.GradientBoostingMachine(max_comps, verbose, init_iterations, learning_rate=None, K_flag=True, LRSimple_flag=True)¶

Bases: object

clean_data()¶: Method to remove invalid values from data arrays.

eval_gbm(x_train, y_train, fold=None, full_dataset=None, log_iterations=False)¶

Method for the actual computations of boosting.

Parameters:

x_train (array) – Array consisting independent variables.
y_train (array) – Array consisting of available dependent variables.
fold (int) – Indicator for the number of fold being evaluated, given only when boosting is not being performed on the full dataset.
full_dataset (bool) – Toggle for evaluation using full dataset.

Returns:

y_i – Array consisting fitted values of dependent variables.

Return type:

array

eval_relative_influence(write_to_disk=False)¶

Method to evaluate relative influence of parameters on the model.

2. Writes scores, relative scores, and corresponding parameters to a dict named inf_data

Parameters:: write_to_disk (bool) – Toggles writing evaluated information to disk. If true, calls write_inf_data_to_disk()
Returns:: A status indicator. 0 indicates successful execution.
Return type:: int

gradient_boosting_cv(folds=10, full_dataset=False, log_iterations=False)¶

Implementation of gradient boosting iterations within a cross-validation framework.

If verbose, prints out in-sample and out-of-sample mean squared-error for each fold.
Sets various attributes that are helpful in post-processing.

Parameters:

x (list or array) – Independent variables that are used to estimate dependent variables.
y (list or array) – Dependent variables that are available from the data.
init_iterations (int) – Number of GBM iterations.
folds (int) – Number of folds into which the data is split into.

Return type:

None

load_data(dep_file_stub, indep_file_stub)¶

Method to load data from specified files. Calls clean_data() to strip data of unsupported data like NaN.

Parameters:

dep_file_stub (str) – String indicating prefix of the file containing dependent variables’ data.
indep_file_stub (str) – String indicating prefix of the file containing independent variables’ data.

Returns:

Status indicator. 0 indicates success, whereas -1 indicates contrary.

Return type:

int

n_fold_split(data_tuple, folds)¶

Method to split given data into given number of folds.

Parameters:

data_tuple (tuple) – Tuple in format (X,Y)
folds (int) – Number of folds.

Returns:

split_data – Dict consisting of split values of x and y with integers as keys.

Return type:

dict

predict(x, d_mat)¶

Method to predict newer values of dependent variables using learned data matrix.

Parameters:

x (list or array) – Array consisting of independent variables to be used.
d_mat (list or array) – Learned data matrix.

Returns:

Predicted dependent variables.

Return type:

array

rmse(pred, actual)¶

Method to compute Root mean squared error (RMSE) between two arrays.

Parameters:

pred (list or array) – Container for predicted variables.
actual (list or array) – Container for values being compared.

Returns:

Computed RMSE.

Return type:

float

squared_error(values, pred)¶

Method to compute squared error between two arrays.

Parameters:

pred (list or array) – Container for predicted variables.
values (list or array) – Container for values being compared.

Returns:

(se, sd) –

1. se returns an array of squared errors compared element wise between two containers.

2. sd returns the standard deviation of this array of sqaured errors.

Return type:

tuple

write_inf_data_to_disk(ncomp)¶

Method to write influence data to disk, toggled by eval_relative_influence().

Parameters:: ncomp (int) – Integer indicating number of components in the specimen.
Returns:: Status indicator. 0 suggests successful execution.
Return type:: int