asid.automl_small package
Submodules
asid.automl_small.dataset_similarity_metrics module
- asid.automl_small.dataset_similarity_metrics.c2st_accuracy(data_orig: ndarray, sampled: ndarray) Tuple[float, float] [source]
Classifier Two-Sample Test: LOO Accuracy for 1-NN classifier.
- Parameters
data_orig (array-like of shape (n_samples, n_features)) – Train sample.
sampled (array-like of shape (n_samples, n_features)) – Synthetic sample.
- Returns
acc_r (float) – Accuracy for real samples.
acc_g (float) – Accuracy for generated samples.
References
Xu, Q. et al. (2018) “An empirical study on evaluation metrics of generative adversarial networks” arXiv preprint arXiv:1806.07755.
- asid.automl_small.dataset_similarity_metrics.c2st_roc_auc(df1: ndarray, df2: ndarray) float [source]
Classifier Two-Sample Test: ROC AUC for gradient boosting classifier.
- Parameters
df1 (array-like of shape (n_samples, n_features)) – Train sample.
df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.
- Returns
roc_auc – ROC AUC value.
- Return type
float
References
Friedman, J. H. (2003) “On Multivariate Goodness–of–Fit and Two–Sample Testing” Statistical Problems in Particle Physics, Astrophysics and Cosmology, PHYSTAT2003: 311-313.
- asid.automl_small.dataset_similarity_metrics.calc_metrics(data: ndarray, sampled_data: ndarray, metric: str, test_data: Union[None, ndarray] = None) Union[float, list] [source]
Calculates dataset similarity metrics.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
sampled_data (array-like of shape (n_samples, n_features)) – Synthetic sample.
metric ({"zu", "c2st_acc", "roc_auc", "ks_test"}) – Metric that is used to choose the optimal generative model.
test_data (array-like of shape (n_samples, n_features)) – Test sample.
- Returns
result – Metric value. For “ks_test” a list [statistic, p-value] is output.
- Return type
float or list
- asid.automl_small.dataset_similarity_metrics.ks_permutation(stat: list, df1: ndarray, df2: ndarray) float [source]
Kolmogorov-Smirnov permutation test applied to each maginal distribution.
- Parameters
stat (list) – List of statistic values for marginal distributions.
df1 (array-like of shape (n_samples, n_features)) – Train sample.
df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.
- Returns
p_val – P-value obtained using permutation test.
- Return type
float
- asid.automl_small.dataset_similarity_metrics.ks_permutation_var(stat: float, series1: ndarray, series2: ndarray) float [source]
Kolmogorov-Smirnov permutation test for marginal distribution.
- Parameters
stat (float) – Statistic value for marginal distribution.
series1 (array-like) – Train sample series.
series2 (array-like) – Synthetic sample series.
- Returns
p_val – P-value.
- Return type
float
- asid.automl_small.dataset_similarity_metrics.ks_test(df1: ndarray, df2: ndarray) Tuple[list, list] [source]
Kolmogorov-Smirnov test applied to each marginal distribution.
- Parameters
df1 (array-like of shape (n_samples, n_features)) – Train sample.
df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.
- Returns
p_val_list (list) – List of p-values for marginal distributions.
stat_list (list) – List of statistic values for marginal distributions.
- asid.automl_small.dataset_similarity_metrics.zu_overfitting_statistic(df1: ndarray, df2: ndarray, df3: ndarray) float [source]
Zu overfitting statistic calculation.
- Parameters
df1 (array-like of shape (n_samples, n_features)) – Test sample.
df2 (array-like of shape (n_samples, n_features)) – Synthetic sample.
df3 (array-like of shape (n_samples, n_features)) – Train sample.
- Returns
zu_stat – Metric value.
- Return type
float
References
Meehan C., Chaudhuri K., Dasgupta S. (2020) “A non-parametric test to detect data-copying in generative models” International Conference on Artificial Intelligence and Statistics.
asid.automl_small.generative_model_estimation module
- asid.automl_small.generative_model_estimation.calc_bayesian_gmm_acc(params: dict, data: ndarray) float [source]
Estimates the performance of BayesianGaussianMixture model with hyper-parameters values.
- Parameters
params (dict) – Hyper-parameters values.
data (array-like of shape (n_samples, n_features)) – Training sample.
- Returns
score – Performance score.
- Return type
float
- asid.automl_small.generative_model_estimation.calc_gmm_acc(params: dict, data: ndarray) float [source]
Estimates the performance of GaussianMixture model with hyper-parameters values.
- Parameters
params (dict) – Hyper-parameters values.
data (array-like of shape (n_samples, n_features)) – Training sample.
- Returns
score – Performance score.
- Return type
float
- asid.automl_small.generative_model_estimation.calc_kde_acc(params: dict, data: ndarray) float [source]
Estimates the performance of KDE (sklearn implementation) with hyper-parameters values.
- Parameters
params (dict) – Hyper-parameters values.
data (array-like of shape (n_samples, n_features)) – Training sample.
- Returns
score – Performance score.
- Return type
float
- asid.automl_small.generative_model_estimation.calc_sdv_acc(params: dict, data: ndarray, alg: str) float [source]
Estimates the performance of SDV model with hyper-parameters values.
- Parameters
params (dict) – Hyper-parameters values.
data (array-like of shape (n_samples, n_features)) – Training sample.
alg (str) – Algorithm label.
- Returns
score – Performance score.
- Return type
float
- asid.automl_small.generative_model_estimation.fit_model(gen_algorithm: str, data: ndarray, hyp_time: int) object [source]
Fits generative model.
- Parameters
gen_algorithm ({"sklearn_kde", "stats_kde_cv_ml", "stats_kde_cv_ls", "gmm", "bayesian_gmm", "ctgan",) – “copula”, “copulagan”, “tvae”} Generative algorithm label.
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
model – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.get_bayesian_gmm_model(data: ndarray, hyp_time: int) object [source]
Estimates Bayesian GMM model.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
gmm – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.get_copula_model(data: ndarray) object [source]
Estimates Copula model.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
- Returns
model – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.get_copulagan_model(data: ndarray, hyp_time: int) object [source]
Estimates CopulaGAN model.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
model – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.get_ctgan_model(data: ndarray, hyp_time: int) object [source]
Estimates CTGAN model.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
model – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.get_gmm_model(data: ndarray, hyp_time: int) object [source]
Estimates GMM model.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
gmm – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.get_tvae_model(data: ndarray, hyp_time: int) object [source]
Estimates TVAE model
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
model – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.sklearn_kde(data: ndarray, hyp_time: int) object [source]
Estimates KDE (sklearn implementation).
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization.
- Returns
kde – Fitted generative model.
- Return type
instance
- asid.automl_small.generative_model_estimation.stats_kde(data: ndarray, method: str) object [source]
Estimates KDE (Statsmodels).
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
method ({"cv_ml", "cv_ls"}) – CV type for bandwidth selection.
- Returns
kde – Fitted generative model.
- Return type
instance
asid.automl_small.generative_model_sampling module
- asid.automl_small.generative_model_sampling.get_sampled_data(model: object, sample_len: int, seed_list: list, method: str, scaling: object) list [source]
Calls a sampling function.
- Parameters
model (instance) – Fitted generative model.
sample_len (int) – Synthetic sample size.
seed_list (list) – The list of random seeds for each synthetic dataset.
method ({"sklearn_kde", "stats_kde_cv_ml", "stats_kde_cv_ls", "gmm", "bayesian_gmm", "ctgan",) – “copula”, “copulagan”, “tvae”} Generative algorithm label.
scaling (instance) – Fitted scaler that is applied prior to generative model estimation.
- Returns
sampled_data_list – The list with synthetiс datasets.
- Return type
list
- asid.automl_small.generative_model_sampling.gmm_sample_procedure(model: object, sample_len: int, scaling: object, num_samples: int) list [source]
Sampling from GMM model.
- Parameters
model (instance) – Fitted generative model.
sample_len (int) – Synthetic sample size.
scaling (instance) – Fitted scaler that is applied prior to generative model estimation.
num_samples (int) – Required number of synthetic datasets.
- Returns
sampled_data_list – The list with synthetiс datasets.
- Return type
list
- asid.automl_small.generative_model_sampling.sample_sdv_procedure(model: object, sample_len: int, seed_list: list, scaling: object) list [source]
Sampling from SDV library model.
- Parameters
model (instance) – Fitted generative model.
sample_len (int) – Synthetic sample size.
seed_list (list) – The list of random seeds for each synthetic dataset.
scaling (instance) – Fitted scaler that is applied prior to generative model estimation.
- Returns
sampled_data_list – The list with synthetiс datasets.
- Return type
list
- asid.automl_small.generative_model_sampling.sample_stats(kde: object, size: int, seed: int) ndarray [source]
Base sampling procedure from Statsmodel’s KDE.
- Parameters
kde (instance) – Fitted KDE model.
size (int) – Synthetic sample size.
seed (int) – Random seed.
- Returns
sampled_data – Synthetic sample.
- Return type
array-like of shape (n_samples, n_features)
- asid.automl_small.generative_model_sampling.simple_sample_sklearn_procedure(model: object, sample_len: int, seed_list: list, scaling: object) list [source]
Sampling synthetic datasets from sklearn KDE.
- Parameters
model (instance) – Fitted generative model.
sample_len (int) – Synthetic sample size.
seed_list (list) – The list of random seeds for each synthetic dataset.
scaling (instance) – Fitted scaler that is applied prior to generative model estimation.
- Returns
sampled_data_list – The list with synthetiс datasets.
- Return type
list
- asid.automl_small.generative_model_sampling.simple_sample_stats_procedure(model: object, sample_len: int, seed_list: list, scaling: object) list [source]
Sampling synthetic datasets from Statsmodel’s KDE.
- Parameters
model (instance) – Fitted generative model.
sample_len (int) – Synthetic sample size.
seed_list (list) – The list of random seeds for each synthetic dataset.
scaling (instance) – Fitted scaler that is applied prior to generative model estimation.
- Returns
sampled_data_list – The list with synthetiс datasets.
- Return type
list
asid.automl_small.gm module
- class asid.automl_small.gm.GenerativeModel(gen_model_type='optimize', similarity_metric='zu', num_syn_samples=100, hyperopt_time=0)[source]
Bases:
object
GenerativeModel is a tool designed to find an appropriate generative model for small tabular data. It estimates the similarity of synthetic samples, accounts for overfitting and outputs the optimal option.
- Parameters
gen_model_type ({"optimize", "sklearn_kde", "stats_kde_cv_ml", "stats_kde_cv_ls", "gmm", "bayesian_gmm", "ctgan",) – “copula”, “copulagan”, “tvae”}, default=”optimize” An “optimize” option refers to the process of choosing the optimal generative model with regard to the overfitting or a specific type of generative model could be chosen.
similarity_metric ({"zu", "c2st_acc"} or None, default="zu") – Metric that is used to choose the optimal generative model. “zu” metric refers to a Data-Copying Test from (C. Meehan et al., 2020). “c2st_acc” refers to a Classifier Two-Sample Test, that uses a 1-Nearest Neighbor classifier and computes the leave-one-out (LOO) accuracy separately for the real and generated samples (Q. Xu et al., 2018).
num_syn_samples (int, default=100) – The number of synthetic samples generated to evaluate the similarity_metric score.
hyperopt_time (int, default=0) – The runtime setting (in seconds) for Hyperopt optimization. Hyperopt is used to find the optimal hyper-parameters for generative models except for “stats_kde_cv_ml”, “stats_kde_cv_ls”, “copula” methods.
- gen_model_
Fitted generative model.
- Type
instance
- gen_model_label_
Generative algorithm label.
- Type
instance
- score_
Mean value of similarity_metric for the optimal generative model.
- Type
float
- scaler_
Fitted scaler that is applied prior to generative model estimation.
- Type
instance
- info_
Score and time data series for the range of estimated generative models.
- Type
dict
References
Meehan C., Chaudhuri K., Dasgupta S. (2020) “A non-parametric test to detect data-copying in generative models” International Conference on Artificial Intelligence and Statistics.
Xu, Q., Huang, G., Yuan, Y., Guo, C., Sun, Y., Wu, F., & Weinberger, K. (2018) “An empirical study on evaluation metrics of generative adversarial networks” arXiv preprint arXiv:1806.07755.
- fit(data: ndarray)[source]
Fits GenerativeModel instance.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
- Returns
self – Fitted generative model.
- Return type
GenerativeModel instance
- sample(sample_size: int, random_state=42) ndarray [source]
Generates synthetic sample from GenerativeModel.
- Parameters
sample_size (int) – Required sample size.
random_state (int) – Random state.
- Returns
sampled_data – Synthetic sample.
- Return type
array-like of shape (n_samples, n_features)
- score(train_data: ndarray, similarity_metric: str = 'zu', test_data: Union[None, ndarray] = None) Union[float, dict] [source]
Evaluates the similarity of GenerativeModel samples and train data with the specified similarity metric.
- Parameters
train_data (array-like of shape (n_samples, n_features)) – Training sample.
test_data (array-like of shape (n_samples, n_features)) – Test sample for “zu” calculation.
similarity_metric ({"zu", "c2st_acc", "roc_auc", "ks_test"}, default="zu") – Metric that is used to choose the optimal generative model. “zu” metric refers to a Data-Copying Test from (C. Meehan et al., 2020). “c2st_acc” refers to a Classifier Two-Sample Test, that uses a 1-Nearest Neighbor classifier and computes the leave-one-out (LOO) accuracy separately for the real and generated samples (Q. Xu et al., 2018). “roc_auc” refers to ROC AUC for gradient boosting classifier (Lopez-Paz, D., & Oquab, M., 2017). “ks_test”: the marginal distributions of samples are compared using Kolmogorov-Smirnov test (Massey Jr, F. J., 1951).
- Returns
res_score – Mean value of similarity_metric. For “ks_test” dictionary is output with statistic and p-value resulting from permutation test.
- Return type
float or dict
References
Meehan C., Chaudhuri K., Dasgupta S. (2020) “A non-parametric test to detect data-copying in generative models” International Conference on Artificial Intelligence and Statistics.
Xu, Q., Huang, G., Yuan, Y., Guo, C., Sun, Y., Wu, F., & Weinberger, K. (2018) “An empirical study on evaluation metrics of generative adversarial networks” arXiv preprint arXiv:1806.07755.
Lopez-Paz, D., & Oquab, M. (2017) “Revisiting classifier two-sample tests” International Conference on Learning Representations.
Massey Jr, F. J. (1951) “The Kolmogorov-Smirnov test for goodness of fit” Journal of the American statistical Association, 46(253): 68-78.
asid.automl_small.tools module
- asid.automl_small.tools.choose_and_fit_model(data: ndarray, similarity_metric: Optional[str], scaler: object, data_scaled: ndarray, num_syn_samples: int, hyp_time: int) Tuple[object, str, float, dict] [source]
Chooses an optimal generative model and fits GenerativeModel instance.
- Parameters
data (array-like of shape (n_samples, n_features)) – Training sample.
similarity_metric ({"zu", "c2st_acc"} or None, default="zu") – Metric that is used to choose the optimal generative model.
scaler (instance) – Fitted scaler that is applied prior to generative model estimation.
data_scaled (array-like of shape (n_samples, n_features)) – Normalized training sample.
num_syn_samples (int) – The number of synthetic samples generated to evaluate the similarity_metric score.
hyp_time (int) – The runtime setting (in seconds) for Hyperopt optimization. Hyperopt is used to find the optimal hyper-parameters for generative models.
- Returns
gen_model (instance) – Optimal fitted generative model.
best_alg_label (str) – Optimal generative algorithm label.
best_score (float) – Mean value of similarity_metric for the optimal generative model.
log_dict (dict) – Score and time data series for the range of estimated generative models.