Gradient Boosting Machine

Implementation of gradient boosting machine.

  • This file is a part of Personal Programming Project (PPP) coursework in Computational Materials Science (CMS) M.Sc. course in Technische Universität Bergakademie Freiberg.

  • This file is a part of the project titled Application of statistical learning to predict material properties.

  • For given data, methods present in this file apply Gradient Boosting algorithm to tune the model obtained from local regression.

class gradientMachine.GradientBoostingMachine(max_comps, verbose, init_iterations, learning_rate=None, K_flag=True, LRSimple_flag=True)

Bases: object

clean_data()

Method to remove invalid values from data arrays.

eval_gbm(x_train, y_train, fold=None, full_dataset=None, log_iterations=False)

Method for the actual computations of boosting.

Parameters:
  • x_train (array) – Array consisting independent variables.

  • y_train (array) – Array consisting of available dependent variables.

  • fold (int) – Indicator for the number of fold being evaluated, given only when boosting is not being performed on the full dataset.

  • full_dataset (bool) – Toggle for evaluation using full dataset.

Returns:

y_i – Array consisting fitted values of dependent variables.

Return type:

array

eval_relative_influence(write_to_disk=False)
  1. Method to evaluate relative influence of parameters on the model.

2. Writes scores, relative scores, and corresponding parameters to a dict named inf_data

Parameters:

write_to_disk (bool) – Toggles writing evaluated information to disk. If true, calls write_inf_data_to_disk()

Returns:

A status indicator. 0 indicates successful execution.

Return type:

int

gradient_boosting_cv(folds=10, full_dataset=False, log_iterations=False)

Implementation of gradient boosting iterations within a cross-validation framework.

  • If verbose, prints out in-sample and out-of-sample mean squared-error for each fold.

  • Sets various attributes that are helpful in post-processing.

Parameters:
  • x (list or array) – Independent variables that are used to estimate dependent variables.

  • y (list or array) – Dependent variables that are available from the data.

  • init_iterations (int) – Number of GBM iterations.

  • folds (int) – Number of folds into which the data is split into.

Return type:

None

load_data(dep_file_stub, indep_file_stub)

Method to load data from specified files. Calls clean_data() to strip data of unsupported data like NaN.

Parameters:
  • dep_file_stub (str) – String indicating prefix of the file containing dependent variables’ data.

  • indep_file_stub (str) – String indicating prefix of the file containing independent variables’ data.

Returns:

Status indicator. 0 indicates success, whereas -1 indicates contrary.

Return type:

int

n_fold_split(data_tuple, folds)

Method to split given data into given number of folds.

Parameters:
  • data_tuple (tuple) – Tuple in format (X,Y)

  • folds (int) – Number of folds.

Returns:

split_data – Dict consisting of split values of x and y with integers as keys.

Return type:

dict

predict(x, d_mat)

Method to predict newer values of dependent variables using learned data matrix.

Parameters:
  • x (list or array) – Array consisting of independent variables to be used.

  • d_mat (list or array) – Learned data matrix.

Returns:

Predicted dependent variables.

Return type:

array

rmse(pred, actual)

Method to compute Root mean squared error (RMSE) between two arrays.

Parameters:
  • pred (list or array) – Container for predicted variables.

  • actual (list or array) – Container for values being compared.

Returns:

Computed RMSE.

Return type:

float

squared_error(values, pred)

Method to compute squared error between two arrays.

Parameters:
  • pred (list or array) – Container for predicted variables.

  • values (list or array) – Container for values being compared.

Returns:

(se, sd)

1. se returns an array of squared errors compared element wise between two containers.

2. sd returns the standard deviation of this array of sqaured errors.

Return type:

tuple

write_inf_data_to_disk(ncomp)

Method to write influence data to disk, toggled by eval_relative_influence().

Parameters:

ncomp (int) – Integer indicating number of components in the specimen.

Returns:

Status indicator. 0 suggests successful execution.

Return type:

int