asid.automl_imbalanced package
Submodules
asid.automl_imbalanced.abb module
- class asid.automl_imbalanced.abb.AutoBalanceBoost(num_iter=40, num_est=16)[source]
Bases:
object
AutoBalanceBoost classifier is a tailored imbalanced learning framework with the built-in hyper-parameters tuning procedure.
- Parameters
num_iter (int, default=40) – The number of boosting iterations.
num_est (int, default=16) – The number of estimators in the base ensemble.
- ensemble_
The list of fitted ensembles that constitute AutoBalanceBoost model.
- Type
list
- param_
The optimal values of AutoBalanceBoost hyper-parameters.
- Type
dict
- feature_importances() ndarray [source]
Calculates normalized feature importances.
- Returns
feat_imp – The normalized feature importances.
- Return type
array-like
- fit(x: ndarray, y: ndarray)[source]
Fits AutoBalanceBoost model.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
- Returns
self – Fitted estimator.
- Return type
AutoBalanceBoost classifier
asid.automl_imbalanced.check_tools module
asid.automl_imbalanced.ilc module
- class asid.automl_imbalanced.ilc.ImbalancedLearningClassifier(split_num=5, hyperopt_time=0, eval_metric='f1_macro')[source]
Bases:
object
ImbalancedLearningClassifier finds an optimal classifier among the combinations of balancing procedures from imbalanced-learn library (with Hyperopt optimization) and state-of-the-art ensemble classifiers, and the tailored classifier AutoBalanceBoost.
- Parameters
split_num (int, default=5) – The number of splitting iterations for obtaining an out-of-fold score. If the number is a 5-fold, then StratifiedKFold with 5 splits is repeated with the required number of seeds, otherwise StratifiedShuffleSplit with split_num splits is used.
hyperopt_time (int, default=0) – The runtime setting (in seconds) for Hyperopt optimization. Hyperopt is used to find the optimal hyper-parameters for balancing procedures.
eval_metric ({"accuracy", "roc_auc", "log_loss", "f1_macro", "f1_micro", "f1_weighted"}, default="f1_macro") – Metric that is used to evaluate the model performance and to choose the best option.
- classifer_
Optimal fitted classifier.
- Type
instance
- classifer_label_
Optimal classifier label.
- Type
str
- score_
Averaged out-of-fold value of eval_metric for the optimal classifier.
- Type
float
- scaler_
Fitted scaler that is applied prior to classifier estimation.
- Type
instance
- encoder_
Fitted label encoder.
- Type
instance
- classes_
Class labels.
- Type
array-like
- evaluated_models_scores_
Score series for the range of estimated classifiers.
- Type
dict
- evaluated_models_time_
Time data for the range of estimated classifiers.
- Type
dict
- conf_int_
95% confidence interval for the out-of-fold value of eval_metric for the optimal classifier.
- Type
tuple
- fit(x: ndarray, y: ndarray)[source]
Fits ImbalancedLearningClassifier model.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
- Returns
self – Fitted estimator.
- Return type
ImbalancedLearningClassifier instance
- leaderboard() dict [source]
Calculates the leaderboard statistics.
- Returns
ls – The leaderboard statistics that includes sorted lists in accordance with the following indicators: “Mean score”, “Mean rank”, “Share of experiments with the first place, %”, “Average difference with the leader, %”.
- Return type
dict
asid.automl_imbalanced.tools_abb module
- asid.automl_imbalanced.tools_abb.boosting_of_bagging_procedure(x_train: ndarray, y_train: ndarray, num_iter: int, num_mod: int) Tuple[list, dict] [source]
Fits an AutoBalanceBoost model.
- Parameters
x_train (array-like of shape (n_samples, n_features)) – Training sample.
y_train (array-like) – The target values.
num_iter (int) – The number of boosting iterations.
num_mod (int) – The number of estimators in the base ensemble.
- Returns
model_list (list) – Fitted base estimators in AutoBalanceBoost.
boosting_params (dict) – CV procedure data.
- asid.automl_imbalanced.tools_abb.calc_fscore(x: ndarray, y: ndarray, model_list: list, classes_sorted_train: ndarray) Tuple[float, ndarray] [source]
Calculates the CV test score.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
model_list (list) – Fitted base estimators in AutoBalanceBoost.
classes_sorted_train (array-like) – Class labels.
- Returns
fscore_val (float) – CV test score.
fscore_val_val (array-like) – CV test score for each class separately.
Calculates performance shares for different bagging share values.
- Parameters
series_a (array-like) – Scores for bagging share value for a range of splitting iterations.
series_b (array-like) – Scores for bagging share value with the highest mean score for a range of splitting iterations.
sample_gen1 (instance) – Random sample generator.
sample_gen2 (instance) – Random sample generator.
- Returns
share – Performance share for bagging share value.
- Return type
float
- asid.automl_imbalanced.tools_abb.choose_feat(x: ndarray, n: int, feat_gen: object, feat_imp: ndarray) ndarray [source]
Samples the zeroed features.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
n (int) – The number of features that are not zeroed.
feat_gen (instance) – Random sample generator.
feat_imp (array-like) – Normalized feature importances.
- Returns
x – Training sample with zeroed features.
- Return type
array-like of shape (n_samples, n_features)
- asid.automl_imbalanced.tools_abb.cv_balance_procedure(x: ndarray, y: ndarray, split_coef: float, classes_: ndarray) dict [source]
Chooses the optimal balancing strategy.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
split_coef (float) – Train sample share for base learner estimation.
classes (array-like) – Class labels.
- Returns
bagging_ensemble_param – CV procedure data.
- Return type
dict
- asid.automl_imbalanced.tools_abb.cv_split_procedure(x: ndarray, y: ndarray, bagging_ensemble_param: dict) dict [source]
Chooses an optimal list of bagging shares.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
bagging_ensemble_param (dict) – CV procedure data.
- Returns
bagging_ensemble_param – CV procedure data.
- Return type
dict
- asid.automl_imbalanced.tools_abb.first_ensemble_procedure(x: ndarray, y: ndarray, ts: list, num_mod: int, balanced: Union[bool, dict], num_feat: int, feat_gen: object, res_feat_imp: ndarray, classes_sorted_train: ndarray, ts_gen: object) Tuple[list, list, ndarray] [source]
Fits bagging at the first iteration.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
ts (list) – A range of train sample shares for base learner estimation.
num_mod (int) – The number of estimators in the base ensemble.
balanced (bool or dict) – Balancing strategy parameter.
num_feat (int) – The number of features that are not zeroed.
feat_gen (instance) – Random sample generator.
res_feat_imp (array-like) – Normalized feature importances.
classes_sorted_train (array-like) – The sorted unique class values.
ts_gen (instance) – Random sample generator.
- Returns
pred_proba_list (list) – Class probabilities predicted by each base estimator in AutoBalanceBoost.
model_list (list) – Fitted base estimators in AutoBalanceBoost.
feat_imp_list_mean (array-like) – Normalized feature importances.
- asid.automl_imbalanced.tools_abb.first_ensemble_procedure_with_cv_model(x: ndarray, first_model: list, classes_sorted_train: ndarray) list [source]
Calculates the prediction probabilities of the CV bagging.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
first_model (list) – Fitted base estimators in AutoBalanceBoost at the first iteration.
classes_sorted_train (array-like) – The sorted unique class values.
- Returns
res_proba_mean (list) – Class probabilities predicted at the first iteration.
model_list (list) – Fitted base estimators in AutoBalanceBoost.
- asid.automl_imbalanced.tools_abb.fit_ensemble(x: ndarray, y: ndarray, ts: Union[float, list], iter_lim: int, num_mod: int, balanced: Union[bool, dict], first_model: Optional[list], num_feat: int, feat_imp: ndarray, classes_: ndarray) Tuple[list, ndarray] [source]
Iteratively fits the resulting ensemble.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
ts (float or list) – A range of train sample shares for base learner estimation.
iter_lim (int) – The number of boosting iterations.
num_mod (int) – The number of estimators in the base ensemble.
balanced (bool or dict) – Balancing strategy parameter.
first_model (list or None) – Fitted base estimators in AutoBalanceBoost at the first iteration.
num_feat (int) – The number of features that are not zeroed.
feat_imp (array-like) – Normalized feature importances.
classes (array-like) – Class labels.
- Returns
model_list (list) – Fitted base estimators in AutoBalanceBoost.
feat_imp_list_mean (array-like) – Normalized feature importances.
- asid.automl_imbalanced.tools_abb.get_best_bc(split_range: ndarray, f_score_list: list, sample_gen1: object, sample_gen2: object) Tuple[list, list] [source]
Chooses a list of bagging shares with the best performance.
- Parameters
split_range (array-like) – Bagging share values.
f_score_list (list) – CV scores for bagging share values.
sample_gen1 (instance) – Random sample generator.
sample_gen2 (instance) – Random sample generator.
- Returns
split_arg (list) – List of optimal bagging share values.
ind_bc (list) – Indices of optimal bagging share values.
- asid.automl_imbalanced.tools_abb.get_bootstrap_balanced_samples(x: ndarray, y: ndarray, balanced: Union[bool, dict], ts: list, sample_gen: object) Tuple[ndarray, ndarray] [source]
Balancing procedure at the first iteration.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
ts (list) – A range of train sample shares for base learner estimation.
balanced (bool or dict) – Balancing strategy parameter.
sample_gen (instance) – Random sample generator.
- Returns
x_sampled (array-like of shape (n_samples, n_features)) – Generated training sample.
y_sampled (array-like) – Generated target values.
- asid.automl_imbalanced.tools_abb.get_feat_imp(model_list: list) ndarray [source]
Returns normalized feature importances.
- Parameters
model_list (list) – Fitted base estimators in AutoBalanceBoost.
- Returns
feat_imp_norm – Normalized feature importances.
- Return type
array-like
- asid.automl_imbalanced.tools_abb.get_newds(pred_proba: ndarray, ts: list, x: ndarray, y: ndarray, num_mod: int, balanced: Union[bool, dict], num_feat: int, feat_gen: object, feat_imp: ndarray, ts_gen: object) Tuple[list, list] [source]
Samples train datasets for bagging during the boosting phase.
- Parameters
pred_proba (array-like) – Class probabilities predicted by AutoBalanceBoost for the correct class.
ts (list) – A range of train sample shares for base learner estimation.
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
num_mod (int) – The number of estimators in the base ensemble.
balanced (bool or dict) – Balancing strategy parameter.
num_feat (int) – The number of features that are not zeroed.
feat_gen (instance) – Random sample generator.
feat_imp (array-like) – Normalized feature importances.
ts_gen (instance) – Random sample generator.
- Returns
train_datasets (list) – Randomly generated train datasets for bagging.
class_prop (list) – Class shares for each train dataset.
- asid.automl_imbalanced.tools_abb.get_pred(model_list: list, x_test: ndarray) ndarray [source]
Predicts class labels.
- Parameters
model_list (list) – Fitted base estimators in AutoBalanceBoost.
x_test (array-like of shape (n_samples, n_features)) – Test sample.
- Returns
pred_mean_hard – The predicted class.
- Return type
array-like
- asid.automl_imbalanced.tools_abb.get_pred_proba(model_list: list, x_test: ndarray) ndarray [source]
Predicts class probabilities.
- Parameters
model_list (list) – Fitted base estimators in AutoBalanceBoost.
x_test (array-like of shape (n_samples, n_features)) – Test sample.
- Returns
proba_mean_hard – The predicted class probabilities.
- Return type
array-like of shape (n_samples, n_classes)
- asid.automl_imbalanced.tools_abb.num_feat_procedure(x: ndarray, y: ndarray, bagging_ensemble_param: dict) Tuple[dict, list] [source]
Chooses an optimal number of zeroed features.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
bagging_ensemble_param (dict) – CV procedure data.
- Returns
bagging_ensemble_param (dict) – CV procedure data.
res_model (list) – Fitted base estimators in AutoBalanceBoost.
- asid.automl_imbalanced.tools_abb.other_ensemble_procedure(x: ndarray, train_datasets: list, pred_proba_list: list, model_list: list, classes_sorted_train: ndarray) Tuple[list, list] [source]
Fits bagging during the boosting phase.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
train_datasets (list) – Randomly generated train datasets for bagging.
pred_proba_list (list) – Class probabilities predicted by each base estimator in AutoBalanceBoost.
model_list (list) – Fitted base estimators in AutoBalanceBoost.
classes_sorted_train (array-like) – The sorted unique class values.
- Returns
pred_proba_list (list) – Class probabilities predicted by each base estimator in AutoBalanceBoost.
model_list (list) – Fitted base estimators in AutoBalanceBoost.
asid.automl_imbalanced.tools_ilc module
- asid.automl_imbalanced.tools_ilc.abb_exp(x: ndarray, y: ndarray, skf: object, metric: str) Tuple[list, list] [source]
Evaluates AutoBalanceBoost performance on a partial range of splits.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
skf (instance) – Splitting strategy instance.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
score_list (list) – Model performance on a range of splits.
time_list (list) – Model fitting and prediction time on a range of splits.
- asid.automl_imbalanced.tools_ilc.balance_exp(x: ndarray, y: ndarray, skf: object, bal_alg: str, alg: str, hyperopt_time: int, metric: str) Tuple[list, list] [source]
Evaluates model performance on a range of splits.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
skf (instance) – Splitting strategy instance.
bal_alg (str) – Sampling procedure label.
alg (str) – Ensemble classifier label.
hyperopt_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
score_list (list) – Model performance on a range of splits.
time_list (list) – Model fitting and prediction time on a range of splits.
- asid.automl_imbalanced.tools_ilc.calc_leaderboard(self) dict [source]
Calculates the leaderboard statistics.
- Returns
ls – The leaderboard statistics that includes sorted lists in accordance with the following indicators: “Mean score”, “Mean rank”, “Share of experiments with the first place, %”, “Average difference with the leader, %”.
- Return type
dict
- asid.automl_imbalanced.tools_ilc.calc_metric(y_test: ndarray, pred: ndarray, metric: str) float [source]
Calculates the evaluation metric.
- Parameters
y_test (array-like) – Correct target values.
pred (array-like) – Predicted target values.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
score – Metric value.
- Return type
float
- asid.automl_imbalanced.tools_ilc.calc_pipeline_acc(params: dict, x: ndarray, y: ndarray, bal_alg: str, alg: str, metric: str) float [source]
Evaluates the pipeline.
- Parameters
params (dict) – Parameters generated by Hyperopt.
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
bal_alg (str) – Sampling procedure label.
alg (str) – Ensemble classifier label.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
score – Evaluation of the model performance.
- Return type
float
- asid.automl_imbalanced.tools_ilc.choose_and_fit_ilc(self, x: ndarray, y: ndarray) Tuple[object, str, float, object, dict, dict, tuple] [source]
Chooses the optimal classifier and fits the resulting estimator.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
- Returns
classifer (instance) – Optimal fitted classifier.
option_label (str) – Optimal classifier label.
score (float) – Averaged out-of-fold value of eval_metric for the optimal classifier.
scaler (instance) – Fitted scaler that is applied prior to classifier estimation.
score_dict (dict) – Score series for the range of estimated classifiers.
time_dict (dict) – Time data for the range of estimated classifiers.
conf_int (tuple) – 95% confidence interval for the out-of-fold value of eval_metric for the optimal classifier.
- asid.automl_imbalanced.tools_ilc.fit_alg(cv_type: str, x: ndarray, y: ndarray, bal_alg: Optional[str], alg: str, hyperopt_time: int, split_num: int, metric: str) Tuple[list, list] [source]
Evaluates model performance on a full range of splits.
- Parameters
cv_type (str) – The chosen type of splitting iterations.
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
bal_alg (str or None) – Sampling procedure label.
alg (str) – Ensemble classifier label.
hyperopt_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
split_num (int) – The number of splitting iterations.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
score_list (list) – Model performance on a range of splits.
time_list (list) – Model fitting and prediction time on a range of splits.
- asid.automl_imbalanced.tools_ilc.fit_res_model(option_label: str, x: ndarray, y: ndarray, hyp_time: int, metric: str) Tuple[object, object] [source]
Fits the resulting estimator.
- Parameters
option_label (str) – Classifier label.
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
model (instance) – Fitted estimator.
scaler (instance) – Fitted scaler.
- asid.automl_imbalanced.tools_ilc.get_balance_params(x: ndarray, y: ndarray, bal_alg: str, alg: str, hyp_time: int, metric: str) dict [source]
Searches for optimal hyper-parameters for balancing procedure using Hyperopt.
- Parameters
x (array-like of shape (n_samples, n_features)) – Training sample.
y (array-like) – The target values.
bal_alg (str) – Sampling procedure label.
alg (str) – Ensemble classifier label.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
metric (str) – Metric that is used to evaluate the model performance.
- Returns
best – Optimal hyper-parameters for balancing procedure chosen by Hyperopt.
- Return type
dict
- asid.automl_imbalanced.tools_ilc.get_cv_type(split_num: int) str [source]
Defines the type of splitting iterations.
- Parameters
split_num (int) – The number of splitting iterations.
- Returns
cv_type – The chosen type of splitting iterations.
- Return type
str
- asid.automl_imbalanced.tools_ilc.get_sampl_strat_for_case(ss: float, count_class: ndarray, balance_method: str) Union[float, dict] [source]
Calculates the sampling strategy parameter.
- Parameters
ss (float) – Sampling strategy parameter generated by Hyperopt.
count_class (array-like) – The sorted unique values with the number of counts.
balance_method (str) – Balancing procedure label.
- Returns
ss_corr – The adjusted sampling strategy parameter.
- Return type
float or dict
- asid.automl_imbalanced.tools_ilc.scale_data(x_train: ndarray) Tuple[ndarray, object] [source]
Fits scaler and applies it to the train sample.
- Parameters
x_train (array-like of shape (n_samples, n_features)) – Training sample.
- Returns
x_train_scaled (array-like of shape (n_samples, n_features)) – Scaled sample.
scaler (instance) – Fitted scaler.