asid.automl_small package

Submodules

asid.automl_small.dataset_similarity_metrics module

asid.automl_small.dataset_similarity_metrics.c2st_accuracy(data_orig: ndarray, sampled: ndarray) Tuple[float, float][source]

Classifier Two-Sample Test: LOO Accuracy for 1-NN classifier.

Parameters
  • data_orig (array-like of shape (n_samples, n_features)) – Train sample.

  • sampled (array-like of shape (n_samples, n_features)) – Synthetic sample.

Returns

  • acc_r (float) – Accuracy for real samples.

  • acc_g (float) – Accuracy for generated samples.

References

Xu, Q. et al. (2018) “An empirical study on evaluation metrics of generative adversarial networks” arXiv preprint arXiv:1806.07755.

asid.automl_small.dataset_similarity_metrics.c2st_roc_auc(df1: ndarray, df2: ndarray) float[source]

Classifier Two-Sample Test: ROC AUC for gradient boosting classifier.

Parameters
  • df1 (array-like of shape (n_samples, n_features)) – Train sample.

  • df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.

Returns

roc_auc – ROC AUC value.

Return type

float

References

Friedman, J. H. (2003) “On Multivariate Goodness–of–Fit and Two–Sample Testing” Statistical Problems in Particle Physics, Astrophysics and Cosmology, PHYSTAT2003: 311-313.

asid.automl_small.dataset_similarity_metrics.calc_metrics(data: ndarray, sampled_data: ndarray, metric: str, test_data: Union[None, ndarray] = None) Union[float, list][source]

Calculates dataset similarity metrics.

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • sampled_data (array-like of shape (n_samples, n_features)) – Synthetic sample.

  • metric ({"zu", "c2st_acc", "roc_auc", "ks_test"}) – Metric that is used to choose the optimal generative model.

  • test_data (array-like of shape (n_samples, n_features)) – Test sample.

Returns

result – Metric value. For “ks_test” a list [statistic, p-value] is output.

Return type

float or list

asid.automl_small.dataset_similarity_metrics.ks_permutation(stat: list, df1: ndarray, df2: ndarray) float[source]

Kolmogorov-Smirnov permutation test applied to each maginal distribution.

Parameters
  • stat (list) – List of statistic values for marginal distributions.

  • df1 (array-like of shape (n_samples, n_features)) – Train sample.

  • df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.

Returns

p_val – P-value obtained using permutation test.

Return type

float

asid.automl_small.dataset_similarity_metrics.ks_permutation_var(stat: float, series1: ndarray, series2: ndarray) float[source]

Kolmogorov-Smirnov permutation test for marginal distribution.

Parameters
  • stat (float) – Statistic value for marginal distribution.

  • series1 (array-like) – Train sample series.

  • series2 (array-like) – Synthetic sample series.

Returns

p_val – P-value.

Return type

float

asid.automl_small.dataset_similarity_metrics.ks_test(df1: ndarray, df2: ndarray) Tuple[list, list][source]

Kolmogorov-Smirnov test applied to each marginal distribution.

Parameters
  • df1 (array-like of shape (n_samples, n_features)) – Train sample.

  • df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.

Returns

  • p_val_list (list) – List of p-values for marginal distributions.

  • stat_list (list) – List of statistic values for marginal distributions.

asid.automl_small.dataset_similarity_metrics.zu_overfitting_statistic(df1: ndarray, df2: ndarray, df3: ndarray) float[source]

Zu overfitting statistic calculation.

Parameters
  • df1 (array-like of shape (n_samples, n_features)) – Test sample.

  • df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.

  • df3 (array-like of shape (n_samples, n_features)) – Train sample.

Returns

zu_stat – Metric value.

Return type

float

References

Meehan C., Chaudhuri K., Dasgupta S. (2020) “A non-parametric test to detect data-copying in generative models” International Conference on Artificial Intelligence and Statistics.

asid.automl_small.generative_model_estimation module

asid.automl_small.generative_model_estimation.calc_bayesian_gmm_acc(params: dict, data: ndarray) float[source]

Estimates the performance of BayesianGaussianMixture model with hyper-parameters values.

Parameters
  • params (dict) – Hyper-parameters values.

  • data (array-like of shape (n_samples, n_features)) – Training sample.

Returns

score – Performance score.

Return type

float

asid.automl_small.generative_model_estimation.calc_gmm_acc(params: dict, data: ndarray) float[source]

Estimates the performance of GaussianMixture model with hyper-parameters values.

Parameters
  • params (dict) – Hyper-parameters values.

  • data (array-like of shape (n_samples, n_features)) – Training sample.

Returns

score – Performance score.

Return type

float

asid.automl_small.generative_model_estimation.calc_kde_acc(params: dict, data: ndarray) float[source]

Estimates the performance of KDE (sklearn implementation) with hyper-parameters values.

Parameters
  • params (dict) – Hyper-parameters values.

  • data (array-like of shape (n_samples, n_features)) – Training sample.

Returns

score – Performance score.

Return type

float

asid.automl_small.generative_model_estimation.calc_sdv_acc(params: dict, data: ndarray, alg: str) float[source]

Estimates the performance of SDV model with hyper-parameters values.

Parameters
  • params (dict) – Hyper-parameters values.

  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • alg (str) – Algorithm label.

Returns

score – Performance score.

Return type

float

asid.automl_small.generative_model_estimation.fit_model(gen_algorithm: str, data: ndarray, hyp_time: int) object[source]

Fits generative model.

Parameters
  • gen_algorithm ({"sklearn_kde", "stats_kde_cv_ml", "stats_kde_cv_ls", "gmm", "bayesian_gmm", "ctgan",) – “copula”, “copulagan”, “tvae”} Generative algorithm label.

  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

model – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.get_bayesian_gmm_model(data: ndarray, hyp_time: int) object[source]

Estimates Bayesian GMM model.

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

gmm – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.get_copula_model(data: ndarray) object[source]

Estimates Copula model.

Parameters

data (array-like of shape (n_samples, n_features)) – Training sample.

Returns

model – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.get_copulagan_model(data: ndarray, hyp_time: int) object[source]

Estimates CopulaGAN model.

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

model – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.get_ctgan_model(data: ndarray, hyp_time: int) object[source]

Estimates CTGAN model.

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

model – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.get_gmm_model(data: ndarray, hyp_time: int) object[source]

Estimates GMM model.

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

gmm – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.get_tvae_model(data: ndarray, hyp_time: int) object[source]

Estimates TVAE model

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

model – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.sklearn_kde(data: ndarray, hyp_time: int) object[source]

Estimates KDE (sklearn implementation).

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.

Returns

kde – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_estimation.stats_kde(data: ndarray, method: str) object[source]

Estimates KDE (Statsmodels).

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • method ({"cv_ml", "cv_ls"}) – CV type for bandwidth selection.

Returns

kde – Fitted generative model.

Return type

instance

asid.automl_small.generative_model_sampling module

asid.automl_small.generative_model_sampling.get_sampled_data(model: object, sample_len: int, seed_list: list, method: str, scaling: object) list[source]

Calls a sampling function.

Parameters
  • model (instance) – Fitted generative model.

  • sample_len (int) – Synthetic sample size.

  • seed_list (list) – The list of random seeds for each synthetic dataset.

  • method ({"sklearn_kde", "stats_kde_cv_ml", "stats_kde_cv_ls", "gmm", "bayesian_gmm", "ctgan",) – “copula”, “copulagan”, “tvae”} Generative algorithm label.

  • scaling (instance) – Fitted scaler that is applied prior to generative model estimation.

Returns

sampled_data_list – The list with synthetiс datasets.

Return type

list

asid.automl_small.generative_model_sampling.gmm_sample_procedure(model: object, sample_len: int, scaling: object, num_samples: int) list[source]

Sampling from GMM model.

Parameters
  • model (instance) – Fitted generative model.

  • sample_len (int) – Synthetic sample size.

  • scaling (instance) – Fitted scaler that is applied prior to generative model estimation.

  • num_samples (int) – Required number of synthetic datasets.

Returns

sampled_data_list – The list with synthetiс datasets.

Return type

list

asid.automl_small.generative_model_sampling.sample_sdv_procedure(model: object, sample_len: int, seed_list: list, scaling: object) list[source]

Sampling from SDV library model.

Parameters
  • model (instance) – Fitted generative model.

  • sample_len (int) – Synthetic sample size.

  • seed_list (list) – The list of random seeds for each synthetic dataset.

  • scaling (instance) – Fitted scaler that is applied prior to generative model estimation.

Returns

sampled_data_list – The list with synthetiс datasets.

Return type

list

asid.automl_small.generative_model_sampling.sample_stats(kde: object, size: int, seed: int) ndarray[source]

Base sampling procedure from Statsmodel’s KDE.

Parameters
  • kde (instance) – Fitted KDE model.

  • size (int) – Synthetic sample size.

  • seed (int) – Random seed.

Returns

sampled_data – Synthetic sample.

Return type

array-like of shape (n_samples, n_features)

asid.automl_small.generative_model_sampling.simple_sample_sklearn_procedure(model: object, sample_len: int, seed_list: list, scaling: object) list[source]

Sampling synthetic datasets from sklearn KDE.

Parameters
  • model (instance) – Fitted generative model.

  • sample_len (int) – Synthetic sample size.

  • seed_list (list) – The list of random seeds for each synthetic dataset.

  • scaling (instance) – Fitted scaler that is applied prior to generative model estimation.

Returns

sampled_data_list – The list with synthetiс datasets.

Return type

list

asid.automl_small.generative_model_sampling.simple_sample_stats_procedure(model: object, sample_len: int, seed_list: list, scaling: object) list[source]

Sampling synthetic datasets from Statsmodel’s KDE.

Parameters
  • model (instance) – Fitted generative model.

  • sample_len (int) – Synthetic sample size.

  • seed_list (list) – The list of random seeds for each synthetic dataset.

  • scaling (instance) – Fitted scaler that is applied prior to generative model estimation.

Returns

sampled_data_list – The list with synthetiс datasets.

Return type

list

asid.automl_small.gm module

class asid.automl_small.gm.GenerativeModel(gen_model_type='optimize', similarity_metric='zu', num_syn_samples=100, hyperopt_time=0)[source]

Bases: object

GenerativeModel is a tool designed to find an appropriate generative model for small tabular data. It estimates the similarity of synthetic samples, accounts for overfitting and outputs the optimal option.

Parameters
  • gen_model_type ({"optimize", "sklearn_kde", "stats_kde_cv_ml", "stats_kde_cv_ls", "gmm", "bayesian_gmm", "ctgan",) – “copula”, “copulagan”, “tvae”}, default=”optimize” An “optimize” option refers to the process of choosing the optimal generative model with regard to the overfitting or a specific type of generative model could be chosen.

  • similarity_metric ({"zu", "c2st_acc"} or None, default="zu") – Metric that is used to choose the optimal generative model. “zu” metric refers to a Data-Copying Test from (C. Meehan et al., 2020). “c2st_acc” refers to a Classifier Two-Sample Test, that uses a 1-Nearest Neighbor classifier and computes the leave-one-out (LOO) accuracy separately for the real and generated samples (Q. Xu et al., 2018).

  • num_syn_samples (int, default=100) – The number of synthetic samples generated to evaluate the similarity_metric score.

  • hyperopt_time (int, default=0) – The runtime setting (in seconds) for Hyperopt optimization. Hyperopt is used to find the optimal hyper-parameters for generative models except for “stats_kde_cv_ml”, “stats_kde_cv_ls”, “copula” methods.

gen_model_

Fitted generative model.

Type

instance

gen_model_label_

Generative algorithm label.

Type

instance

score_

Mean value of similarity_metric for the optimal generative model.

Type

float

scaler_

Fitted scaler that is applied prior to generative model estimation.

Type

instance

info_

Score and time data series for the range of estimated generative models.

Type

dict

References

Meehan C., Chaudhuri K., Dasgupta S. (2020) “A non-parametric test to detect data-copying in generative models” International Conference on Artificial Intelligence and Statistics.

Xu, Q., Huang, G., Yuan, Y., Guo, C., Sun, Y., Wu, F., & Weinberger, K. (2018) “An empirical study on evaluation metrics of generative adversarial networks” arXiv preprint arXiv:1806.07755.

fit(data: ndarray)[source]

Fits GenerativeModel instance.

Parameters

data (array-like of shape (n_samples, n_features)) – Training sample.

Returns

self – Fitted generative model.

Return type

GenerativeModel instance

sample(sample_size: int, random_state=42) ndarray[source]

Generates synthetic sample from GenerativeModel.

Parameters
  • sample_size (int) – Required sample size.

  • random_state (int) – Random state.

Returns

sampled_data – Synthetic sample.

Return type

array-like of shape (n_samples, n_features)

score(train_data: ndarray, similarity_metric: str = 'zu', test_data: Union[None, ndarray] = None) Union[float, dict][source]

Evaluates the similarity of GenerativeModel samples and train data with the specified similarity metric.

Parameters
  • train_data (array-like of shape (n_samples, n_features)) – Training sample.

  • test_data (array-like of shape (n_samples, n_features)) – Test sample for “zu” calculation.

  • similarity_metric ({"zu", "c2st_acc", "roc_auc", "ks_test"}, default="zu") – Metric that is used to choose the optimal generative model. “zu” metric refers to a Data-Copying Test from (C. Meehan et al., 2020). “c2st_acc” refers to a Classifier Two-Sample Test, that uses a 1-Nearest Neighbor classifier and computes the leave-one-out (LOO) accuracy separately for the real and generated samples (Q. Xu et al., 2018). “roc_auc” refers to ROC AUC for gradient boosting classifier (Lopez-Paz, D., & Oquab, M., 2017). “ks_test”: the marginal distributions of samples are compared using Kolmogorov-Smirnov test (Massey Jr, F. J., 1951).

Returns

res_score – Mean value of similarity_metric. For “ks_test” dictionary is output with statistic and p-value resulting from permutation test.

Return type

float or dict

References

Meehan C., Chaudhuri K., Dasgupta S. (2020) “A non-parametric test to detect data-copying in generative models” International Conference on Artificial Intelligence and Statistics.

Xu, Q., Huang, G., Yuan, Y., Guo, C., Sun, Y., Wu, F., & Weinberger, K. (2018) “An empirical study on evaluation metrics of generative adversarial networks” arXiv preprint arXiv:1806.07755.

Lopez-Paz, D., & Oquab, M. (2017) “Revisiting classifier two-sample tests” International Conference on Learning Representations.

Massey Jr, F. J. (1951) “The Kolmogorov-Smirnov test for goodness of fit” Journal of the American statistical Association, 46(253): 68-78.

asid.automl_small.tools module

asid.automl_small.tools.check_gen_model_list(metric: str)[source]
asid.automl_small.tools.check_gm_fitted(self)[source]
asid.automl_small.tools.check_num_type(x: Any, num_type: type, num_cl: str)[source]
asid.automl_small.tools.check_sim_metric_list(metric: str, mtype: str)[source]
asid.automl_small.tools.check_x_y(x: ndarray, y: Union[None, ndarray] = None)[source]
asid.automl_small.tools.choose_and_fit_model(data: ndarray, similarity_metric: Optional[str], scaler: object, data_scaled: ndarray, num_syn_samples: int, hyp_time: int) Tuple[object, str, float, dict][source]

Chooses an optimal generative model and fits GenerativeModel instance.

Parameters
  • data (array-like of shape (n_samples, n_features)) – Training sample.

  • similarity_metric ({"zu", "c2st_acc"} or None, default="zu") – Metric that is used to choose the optimal generative model.

  • scaler (instance) – Fitted scaler that is applied prior to generative model estimation.

  • data_scaled (array-like of shape (n_samples, n_features)) – Normalized training sample.

  • num_syn_samples (int) – The number of synthetic samples generated to evaluate the similarity_metric score.

  • hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization. Hyperopt is used to find the optimal hyper-parameters for generative models.

Returns

  • gen_model (instance) – Optimal fitted generative model.

  • best_alg_label (str) – Optimal generative algorithm label.

  • best_score (float) – Mean value of similarity_metric for the optimal generative model.

  • log_dict (dict) – Score and time data series for the range of estimated generative models.

Module contents