asid.automl_imbalanced package

Submodules

asid.automl_imbalanced.abb module

class asid.automl_imbalanced.abb.AutoBalanceBoost(num_iter=40, num_est=16)[source]

Bases: object

AutoBalanceBoost classifier is a tailored imbalanced learning framework with the built-in hyper-parameters tuning procedure.

Parameters
  • num_iter (int, default=40) – The number of boosting iterations.

  • num_est (int, default=16) – The number of estimators in the base ensemble.

ensemble_

The list of fitted ensembles that constitute AutoBalanceBoost model.

Type

list

param_

The optimal values of AutoBalanceBoost hyper-parameters.

Type

dict

feature_importances() ndarray[source]

Calculates normalized feature importances.

Returns

feat_imp – The normalized feature importances.

Return type

array-like

fit(x: ndarray, y: ndarray)[source]

Fits AutoBalanceBoost model.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

Returns

self – Fitted estimator.

Return type

AutoBalanceBoost classifier

predict(x: ndarray) ndarray[source]

Predicts class label.

Parameters

x (array-like of shape (n_samples, n_features)) – Test sample.

Returns

pred – The predicted class.

Return type

array-like

predict_proba(x: ndarray) ndarray[source]

Predicts class probability.

Parameters

x (array-like of shape (n_samples, n_features)) – Test sample.

Returns

pred_proba – The predicted class probabilities.

Return type

array-like of shape (n_samples, n_classes)

asid.automl_imbalanced.check_tools module

asid.automl_imbalanced.check_tools.check_abb_fitted(self)[source]
asid.automl_imbalanced.check_tools.check_eval_metric_list(metric: str)[source]
asid.automl_imbalanced.check_tools.check_ilc_fitted(self)[source]
asid.automl_imbalanced.check_tools.check_num_type(x: Any, num_type: type, num_cl: str)[source]
asid.automl_imbalanced.check_tools.check_x_y(x: Any, y=None)[source]

asid.automl_imbalanced.ilc module

class asid.automl_imbalanced.ilc.ImbalancedLearningClassifier(split_num=5, hyperopt_time=0, eval_metric='f1_macro')[source]

Bases: object

ImbalancedLearningClassifier finds an optimal classifier among the combinations of balancing procedures from imbalanced-learn library (with Hyperopt optimization) and state-of-the-art ensemble classifiers, and the tailored classifier AutoBalanceBoost.

Parameters
  • split_num (int, default=5) – The number of splitting iterations for obtaining an out-of-fold score. If the number is a 5-fold, then StratifiedKFold with 5 splits is repeated with the required number of seeds, otherwise StratifiedShuffleSplit with split_num splits is used.

  • hyperopt_time (int, default=0) – The runtime setting (in seconds) for Hyperopt optimization. Hyperopt is used to find the optimal hyper-parameters for balancing procedures.

  • eval_metric ({"accuracy", "roc_auc", "log_loss", "f1_macro", "f1_micro", "f1_weighted"}, default="f1_macro") – Metric that is used to evaluate the model performance and to choose the best option.

classifer_

Optimal fitted classifier.

Type

instance

classifer_label_

Optimal classifier label.

Type

str

score_

Averaged out-of-fold value of eval_metric for the optimal classifier.

Type

float

scaler_

Fitted scaler that is applied prior to classifier estimation.

Type

instance

encoder_

Fitted label encoder.

Type

instance

classes_

Class labels.

Type

array-like

evaluated_models_scores_

Score series for the range of estimated classifiers.

Type

dict

evaluated_models_time_

Time data for the range of estimated classifiers.

Type

dict

conf_int_

95% confidence interval for the out-of-fold value of eval_metric for the optimal classifier.

Type

tuple

fit(x: ndarray, y: ndarray)[source]

Fits ImbalancedLearningClassifier model.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

Returns

self – Fitted estimator.

Return type

ImbalancedLearningClassifier instance

leaderboard() dict[source]

Calculates the leaderboard statistics.

Returns

ls – The leaderboard statistics that includes sorted lists in accordance with the following indicators: “Mean score”, “Mean rank”, “Share of experiments with the first place, %”, “Average difference with the leader, %”.

Return type

dict

predict(x: ndarray) ndarray[source]

Predicts class label.

Parameters

x (array-like of shape (n_samples, n_features)) – Test sample.

Returns

pred – The predicted class.

Return type

array-like

predict_proba(x) ndarray[source]

Predicts class label probability.

Parameters

x (array-like of shape (n_samples, n_features)) – Test sample.

Returns

pred_proba – The predicted class probabilities.

Return type

array-like of shape (n_samples, n_classes)

asid.automl_imbalanced.tools_abb module

asid.automl_imbalanced.tools_abb.boosting_of_bagging_procedure(x_train: ndarray, y_train: ndarray, num_iter: int, num_mod: int) Tuple[list, dict][source]

Fits an AutoBalanceBoost model.

Parameters
  • x_train (array-like of shape (n_samples, n_features)) – Training sample.

  • y_train (array-like) – The target values.

  • num_iter (int) – The number of boosting iterations.

  • num_mod (int) – The number of estimators in the base ensemble.

Returns

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • boosting_params (dict) – CV procedure data.

asid.automl_imbalanced.tools_abb.calc_fscore(x: ndarray, y: ndarray, model_list: list, classes_sorted_train: ndarray) Tuple[float, ndarray][source]

Calculates the CV test score.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • classes_sorted_train (array-like) – Class labels.

Returns

  • fscore_val (float) – CV test score.

  • fscore_val_val (array-like) – CV test score for each class separately.

asid.automl_imbalanced.tools_abb.calc_share(series_a: ndarray, series_b: ndarray, sample_gen1: object, sample_gen2: object) float[source]

Calculates performance shares for different bagging share values.

Parameters
  • series_a (array-like) – Scores for bagging share value for a range of splitting iterations.

  • series_b (array-like) – Scores for bagging share value with the highest mean score for a range of splitting iterations.

  • sample_gen1 (instance) – Random sample generator.

  • sample_gen2 (instance) – Random sample generator.

Returns

share – Performance share for bagging share value.

Return type

float

asid.automl_imbalanced.tools_abb.choose_feat(x: ndarray, n: int, feat_gen: object, feat_imp: ndarray) ndarray[source]

Samples the zeroed features.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • n (int) – The number of features that are not zeroed.

  • feat_gen (instance) – Random sample generator.

  • feat_imp (array-like) – Normalized feature importances.

Returns

x – Training sample with zeroed features.

Return type

array-like of shape (n_samples, n_features)

asid.automl_imbalanced.tools_abb.cv_balance_procedure(x: ndarray, y: ndarray, split_coef: float, classes_: ndarray) dict[source]

Chooses the optimal balancing strategy.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • split_coef (float) – Train sample share for base learner estimation.

  • classes (array-like) – Class labels.

Returns

bagging_ensemble_param – CV procedure data.

Return type

dict

asid.automl_imbalanced.tools_abb.cv_split_procedure(x: ndarray, y: ndarray, bagging_ensemble_param: dict) dict[source]

Chooses an optimal list of bagging shares.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • bagging_ensemble_param (dict) – CV procedure data.

Returns

bagging_ensemble_param – CV procedure data.

Return type

dict

asid.automl_imbalanced.tools_abb.first_ensemble_procedure(x: ndarray, y: ndarray, ts: list, num_mod: int, balanced: Union[bool, dict], num_feat: int, feat_gen: object, res_feat_imp: ndarray, classes_sorted_train: ndarray, ts_gen: object) Tuple[list, list, ndarray][source]

Fits bagging at the first iteration.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • ts (list) – A range of train sample shares for base learner estimation.

  • num_mod (int) – The number of estimators in the base ensemble.

  • balanced (bool or dict) – Balancing strategy parameter.

  • num_feat (int) – The number of features that are not zeroed.

  • feat_gen (instance) – Random sample generator.

  • res_feat_imp (array-like) – Normalized feature importances.

  • classes_sorted_train (array-like) – The sorted unique class values.

  • ts_gen (instance) – Random sample generator.

Returns

  • pred_proba_list (list) – Class probabilities predicted by each base estimator in AutoBalanceBoost.

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • feat_imp_list_mean (array-like) – Normalized feature importances.

asid.automl_imbalanced.tools_abb.first_ensemble_procedure_with_cv_model(x: ndarray, first_model: list, classes_sorted_train: ndarray) list[source]

Calculates the prediction probabilities of the CV bagging.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • first_model (list) – Fitted base estimators in AutoBalanceBoost at the first iteration.

  • classes_sorted_train (array-like) – The sorted unique class values.

Returns

  • res_proba_mean (list) – Class probabilities predicted at the first iteration.

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

asid.automl_imbalanced.tools_abb.fit_ensemble(x: ndarray, y: ndarray, ts: Union[float, list], iter_lim: int, num_mod: int, balanced: Union[bool, dict], first_model: Optional[list], num_feat: int, feat_imp: ndarray, classes_: ndarray) Tuple[list, ndarray][source]

Iteratively fits the resulting ensemble.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • ts (float or list) – A range of train sample shares for base learner estimation.

  • iter_lim (int) – The number of boosting iterations.

  • num_mod (int) – The number of estimators in the base ensemble.

  • balanced (bool or dict) – Balancing strategy parameter.

  • first_model (list or None) – Fitted base estimators in AutoBalanceBoost at the first iteration.

  • num_feat (int) – The number of features that are not zeroed.

  • feat_imp (array-like) – Normalized feature importances.

  • classes (array-like) – Class labels.

Returns

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • feat_imp_list_mean (array-like) – Normalized feature importances.

asid.automl_imbalanced.tools_abb.get_best_bc(split_range: ndarray, f_score_list: list, sample_gen1: object, sample_gen2: object) Tuple[list, list][source]

Chooses a list of bagging shares with the best performance.

Parameters
  • split_range (array-like) – Bagging share values.

  • f_score_list (list) – CV scores for bagging share values.

  • sample_gen1 (instance) – Random sample generator.

  • sample_gen2 (instance) – Random sample generator.

Returns

  • split_arg (list) – List of optimal bagging share values.

  • ind_bc (list) – Indices of optimal bagging share values.

asid.automl_imbalanced.tools_abb.get_bootstrap_balanced_samples(x: ndarray, y: ndarray, balanced: Union[bool, dict], ts: list, sample_gen: object) Tuple[ndarray, ndarray][source]

Balancing procedure at the first iteration.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • ts (list) – A range of train sample shares for base learner estimation.

  • balanced (bool or dict) – Balancing strategy parameter.

  • sample_gen (instance) – Random sample generator.

Returns

  • x_sampled (array-like of shape (n_samples, n_features)) – Generated training sample.

  • y_sampled (array-like) – Generated target values.

asid.automl_imbalanced.tools_abb.get_feat_imp(model_list: list) ndarray[source]

Returns normalized feature importances.

Parameters

model_list (list) – Fitted base estimators in AutoBalanceBoost.

Returns

feat_imp_norm – Normalized feature importances.

Return type

array-like

asid.automl_imbalanced.tools_abb.get_newds(pred_proba: ndarray, ts: list, x: ndarray, y: ndarray, num_mod: int, balanced: Union[bool, dict], num_feat: int, feat_gen: object, feat_imp: ndarray, ts_gen: object) Tuple[list, list][source]

Samples train datasets for bagging during the boosting phase.

Parameters
  • pred_proba (array-like) – Class probabilities predicted by AutoBalanceBoost for the correct class.

  • ts (list) – A range of train sample shares for base learner estimation.

  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • num_mod (int) – The number of estimators in the base ensemble.

  • balanced (bool or dict) – Balancing strategy parameter.

  • num_feat (int) – The number of features that are not zeroed.

  • feat_gen (instance) – Random sample generator.

  • feat_imp (array-like) – Normalized feature importances.

  • ts_gen (instance) – Random sample generator.

Returns

  • train_datasets (list) – Randomly generated train datasets for bagging.

  • class_prop (list) – Class shares for each train dataset.

asid.automl_imbalanced.tools_abb.get_pred(model_list: list, x_test: ndarray) ndarray[source]

Predicts class labels.

Parameters
  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • x_test (array-like of shape (n_samples, n_features)) – Test sample.

Returns

pred_mean_hard – The predicted class.

Return type

array-like

asid.automl_imbalanced.tools_abb.get_pred_proba(model_list: list, x_test: ndarray) ndarray[source]

Predicts class probabilities.

Parameters
  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • x_test (array-like of shape (n_samples, n_features)) – Test sample.

Returns

proba_mean_hard – The predicted class probabilities.

Return type

array-like of shape (n_samples, n_classes)

asid.automl_imbalanced.tools_abb.num_feat_procedure(x: ndarray, y: ndarray, bagging_ensemble_param: dict) Tuple[dict, list][source]

Chooses an optimal number of zeroed features.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • bagging_ensemble_param (dict) – CV procedure data.

Returns

  • bagging_ensemble_param (dict) – CV procedure data.

  • res_model (list) – Fitted base estimators in AutoBalanceBoost.

asid.automl_imbalanced.tools_abb.other_ensemble_procedure(x: ndarray, train_datasets: list, pred_proba_list: list, model_list: list, classes_sorted_train: ndarray) Tuple[list, list][source]

Fits bagging during the boosting phase.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • train_datasets (list) – Randomly generated train datasets for bagging.

  • pred_proba_list (list) – Class probabilities predicted by each base estimator in AutoBalanceBoost.

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

  • classes_sorted_train (array-like) – The sorted unique class values.

Returns

  • pred_proba_list (list) – Class probabilities predicted by each base estimator in AutoBalanceBoost.

  • model_list (list) – Fitted base estimators in AutoBalanceBoost.

asid.automl_imbalanced.tools_ilc module

asid.automl_imbalanced.tools_ilc.abb_exp(x: ndarray, y: ndarray, skf: object, metric: str) Tuple[list, list][source]

Evaluates AutoBalanceBoost performance on a partial range of splits.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • skf (instance) – Splitting strategy instance.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

  • score_list (list) – Model performance on a range of splits.

  • time_list (list) – Model fitting and prediction time on a range of splits.

asid.automl_imbalanced.tools_ilc.balance_exp(x: ndarray, y: ndarray, skf: object, bal_alg: str, alg: str, hyperopt_time: int, metric: str) Tuple[list, list][source]

Evaluates model performance on a range of splits.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • skf (instance) – Splitting strategy instance.

  • bal_alg (str) – Sampling procedure label.

  • alg (str) – Ensemble classifier label.

  • hyperopt_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

  • score_list (list) – Model performance on a range of splits.

  • time_list (list) – Model fitting and prediction time on a range of splits.

asid.automl_imbalanced.tools_ilc.calc_leaderboard(self) dict[source]

Calculates the leaderboard statistics.

Returns

ls – The leaderboard statistics that includes sorted lists in accordance with the following indicators: “Mean score”, “Mean rank”, “Share of experiments with the first place, %”, “Average difference with the leader, %”.

Return type

dict

asid.automl_imbalanced.tools_ilc.calc_metric(y_test: ndarray, pred: ndarray, metric: str) float[source]

Calculates the evaluation metric.

Parameters
  • y_test (array-like) – Correct target values.

  • pred (array-like) – Predicted target values.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

score – Metric value.

Return type

float

asid.automl_imbalanced.tools_ilc.calc_pipeline_acc(params: dict, x: ndarray, y: ndarray, bal_alg: str, alg: str, metric: str) float[source]

Evaluates the pipeline.

Parameters
  • params (dict) – Parameters generated by Hyperopt.

  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • bal_alg (str) – Sampling procedure label.

  • alg (str) – Ensemble classifier label.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

score – Evaluation of the model performance.

Return type

float

asid.automl_imbalanced.tools_ilc.choose_and_fit_ilc(self, x: ndarray, y: ndarray) Tuple[object, str, float, object, dict, dict, tuple][source]

Chooses the optimal classifier and fits the resulting estimator.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

Returns

  • classifer (instance) – Optimal fitted classifier.

  • option_label (str) – Optimal classifier label.

  • score (float) – Averaged out-of-fold value of eval_metric for the optimal classifier.

  • scaler (instance) – Fitted scaler that is applied prior to classifier estimation.

  • score_dict (dict) – Score series for the range of estimated classifiers.

  • time_dict (dict) – Time data for the range of estimated classifiers.

  • conf_int (tuple) – 95% confidence interval for the out-of-fold value of eval_metric for the optimal classifier.

asid.automl_imbalanced.tools_ilc.fit_alg(cv_type: str, x: ndarray, y: ndarray, bal_alg: Optional[str], alg: str, hyperopt_time: int, split_num: int, metric: str) Tuple[list, list][source]

Evaluates model performance on a full range of splits.

Parameters
  • cv_type (str) – The chosen type of splitting iterations.

  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • bal_alg (str or None) – Sampling procedure label.

  • alg (str) – Ensemble classifier label.

  • hyperopt_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

  • split_num (int) – The number of splitting iterations.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

  • score_list (list) – Model performance on a range of splits.

  • time_list (list) – Model fitting and prediction time on a range of splits.

asid.automl_imbalanced.tools_ilc.fit_res_model(option_label: str, x: ndarray, y: ndarray, hyp_time: int, metric: str) Tuple[object, object][source]

Fits the resulting estimator.

Parameters
  • option_label (str) – Classifier label.

  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

  • model (instance) – Fitted estimator.

  • scaler (instance) – Fitted scaler.

asid.automl_imbalanced.tools_ilc.get_balance_params(x: ndarray, y: ndarray, bal_alg: str, alg: str, hyp_time: int, metric: str) dict[source]

Searches for optimal hyper-parameters for balancing procedure using Hyperopt.

Parameters
  • x (array-like of shape (n_samples, n_features)) – Training sample.

  • y (array-like) – The target values.

  • bal_alg (str) – Sampling procedure label.

  • alg (str) – Ensemble classifier label.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

  • metric (str) – Metric that is used to evaluate the model performance.

Returns

best – Optimal hyper-parameters for balancing procedure chosen by Hyperopt.

Return type

dict

asid.automl_imbalanced.tools_ilc.get_cv_type(split_num: int) str[source]

Defines the type of splitting iterations.

Parameters

split_num (int) – The number of splitting iterations.

Returns

cv_type – The chosen type of splitting iterations.

Return type

str

asid.automl_imbalanced.tools_ilc.get_sampl_strat_for_case(ss: float, count_class: ndarray, balance_method: str) Union[float, dict][source]

Calculates the sampling strategy parameter.

Parameters
  • ss (float) – Sampling strategy parameter generated by Hyperopt.

  • count_class (array-like) – The sorted unique values with the number of counts.

  • balance_method (str) – Balancing procedure label.

Returns

ss_corr – The adjusted sampling strategy parameter.

Return type

float or dict

asid.automl_imbalanced.tools_ilc.scale_data(x_train: ndarray) Tuple[ndarray, object][source]

Fits scaler and applies it to the train sample.

Parameters

x_train (array-like of shape (n_samples, n_features)) – Training sample.

Returns

  • x_train_scaled (array-like of shape (n_samples, n_features)) – Scaled sample.

  • scaler (instance) – Fitted scaler.

Module contents