Gradient Boosting Machine¶
Implementation of gradient boosting machine.
This file is a part of Personal Programming Project (PPP) coursework in Computational Materials Science (CMS) M.Sc. course in Technische Universität Bergakademie Freiberg.
This file is a part of the project titled Application of statistical learning to predict material properties.
For given data, methods present in this file apply Gradient Boosting algorithm to tune the model obtained from local regression.
- class gradientMachine.GradientBoostingMachine(max_comps, verbose, init_iterations, learning_rate=None, K_flag=True, LRSimple_flag=True)¶
Bases:
object- clean_data()¶
Method to remove invalid values from data arrays.
- eval_gbm(x_train, y_train, fold=None, full_dataset=None, log_iterations=False)¶
Method for the actual computations of boosting.
- Parameters:
x_train (array) – Array consisting independent variables.
y_train (array) – Array consisting of available dependent variables.
fold (int) – Indicator for the number of fold being evaluated, given only when boosting is not being performed on the full dataset.
full_dataset (bool) – Toggle for evaluation using full dataset.
- Returns:
y_i – Array consisting fitted values of dependent variables.
- Return type:
array
- eval_relative_influence(write_to_disk=False)¶
Method to evaluate relative influence of parameters on the model.
2. Writes scores, relative scores, and corresponding parameters to a dict named
inf_data- Parameters:
write_to_disk (bool) – Toggles writing evaluated information to disk. If true, calls
write_inf_data_to_disk()- Returns:
A status indicator. 0 indicates successful execution.
- Return type:
int
- gradient_boosting_cv(folds=10, full_dataset=False, log_iterations=False)¶
Implementation of gradient boosting iterations within a cross-validation framework.
If verbose, prints out in-sample and out-of-sample mean squared-error for each fold.
Sets various attributes that are helpful in post-processing.
- Parameters:
x (list or array) – Independent variables that are used to estimate dependent variables.
y (list or array) – Dependent variables that are available from the data.
init_iterations (int) – Number of GBM iterations.
folds (int) – Number of folds into which the data is split into.
- Return type:
None
- load_data(dep_file_stub, indep_file_stub)¶
Method to load data from specified files. Calls
clean_data()to strip data of unsupported data like NaN.- Parameters:
dep_file_stub (str) – String indicating prefix of the file containing dependent variables’ data.
indep_file_stub (str) – String indicating prefix of the file containing independent variables’ data.
- Returns:
Status indicator. 0 indicates success, whereas -1 indicates contrary.
- Return type:
int
- n_fold_split(data_tuple, folds)¶
Method to split given data into given number of folds.
- Parameters:
data_tuple (tuple) – Tuple in format (X,Y)
folds (int) – Number of folds.
- Returns:
split_data – Dict consisting of split values of x and y with integers as keys.
- Return type:
dict
- predict(x, d_mat)¶
Method to predict newer values of dependent variables using learned data matrix.
- Parameters:
x (list or array) – Array consisting of independent variables to be used.
d_mat (list or array) – Learned data matrix.
- Returns:
Predicted dependent variables.
- Return type:
array
- rmse(pred, actual)¶
Method to compute Root mean squared error (RMSE) between two arrays.
- Parameters:
pred (list or array) – Container for predicted variables.
actual (list or array) – Container for values being compared.
- Returns:
Computed RMSE.
- Return type:
float
- squared_error(values, pred)¶
Method to compute squared error between two arrays.
- Parameters:
pred (list or array) – Container for predicted variables.
values (list or array) – Container for values being compared.
- Returns:
(se, sd) –
1.
sereturns an array of squared errors compared element wise between two containers.2.
sdreturns the standard deviation of this array of sqaured errors.- Return type:
tuple
- write_inf_data_to_disk(ncomp)¶
Method to write influence data to disk, toggled by
eval_relative_influence().- Parameters:
ncomp (int) – Integer indicating number of components in the specimen.
- Returns:
Status indicator. 0 suggests successful execution.
- Return type:
int