autopandas package¶
Subpackages¶
- autopandas.generators package
- Submodules
- autopandas.generators.anm module
- autopandas.generators.artificial module
- autopandas.generators.autoencoder module
- autopandas.generators.copula module
- autopandas.generators.copycat module
- autopandas.generators.gmm module
- autopandas.generators.kde module
- autopandas.generators.sae module
- autopandas.generators.vae module
- Module contents
- autopandas.utils package
- Submodules
- autopandas.utils.automl module
- autopandas.utils.benchmark module
- autopandas.utils.datasets module
- autopandas.utils.encoding module
- autopandas.utils.imputation module
- autopandas.utils.metric module
- autopandas.utils.normalization module
- autopandas.utils.reduction module
- autopandas.utils.sdv module
- autopandas.utils.utilities module
- autopandas.utils.visualization module
- Module contents
Submodules¶
autopandas.autopandas module¶
-
class
autopandas.autopandas.
AutoData
(*args, indexes=None, **kwargs)[source]¶ Bases:
pandas.core.frame.DataFrame
-
__init__
(*args, indexes=None, **kwargs)[source]¶ AutoData is a data structure extending Pandas DataFrame. The goal is to quickly get to grips with a dataset. An AutoData object represents a 2D data frame with:
- Examples in rows
- Features in columns
If needed, automatically do: * numerical/categorical variables inference * train/test split
-
distance
(data, method=None, **kwargs)[source]¶ Distance between two AutoData frames. TODO: There are methods to add (cf. utilities.py and metric.py) Usage example: ad1.distance(ad2, method=’privacy’)
Parameters: - data – Second distribution to compare with
- method – ‘none’ (nn_discrepancy), ‘discriminant’
-
encoding
(method='label', key='categorical', target=None, split=True, **kwargs)[source]¶ Encode (categorical) variables.
Parameters: - method – ‘none’, ‘label’, ‘one-hot’, ‘rare-one-hot’, ‘target’, ‘likelihood’, ‘count’, ‘probability’, ‘binarize’
- key – Subset of data to encode. Put to None for all columns. ‘categorical’ by default.
- target – Target column name (target encoding).
- coeff – Coefficient defining rare values (rare one-hot encoding). A rare category occurs less than the (average number of occurrence * coefficient).
- split – If False, do the processing on the whole frame without train/test split.
-
flush_index
(key=None, compute_types=True)[source]¶ Delete non-existing columns from indexes.
#Delete useless indexes for a specific set. #For example: # key=’X_train’ # -> delete ‘test’ and ‘y’ indexes #Maybe useless.
-
generate
(method=None)[source]¶ Fit a generator and generate data with default/kwargs parameters. TODO
Parameters: method – ANM, GAN, VAE, Copula, etc.
-
get_data
(key=None)[source]¶ Get data.
Parameters: key – wanted subset of data (‘train’, ‘categorical_header’, ‘y’, etc.)
-
get_task
()[source]¶ Return ‘regression’ or ‘classification’ regarding the target type. TODO: multiclass?
-
get_types
()[source]¶ Compute variables types: Numeric or Categorical. This information is then stored as indexes with key ‘numerical’ and ‘categorical’.
-
imputation
(method='most', key=None, model=None, split=True, fit=True)[source]¶ Impute missing values.
Parameters: - method – None, ‘remove’, ‘most’, ‘mean’, ‘median’, ‘infer’
- model – Predictive model (in case of ‘infer’ method)
- fit – If True, fit the model (in case of ‘infer’ method)
- split – If False, do the processing on the whole frame without train/test split.
Returns: Data with imputed values.
Return type:
-
lda
(key=None, verbose=False, **kwargs)[source]¶ Compute Linear Discriminant Analysis.
Parameters: - verbose – Display additional information during run
- **kwargs –
Additional parameters for LDA (see sklearn doc)
Returns: Transformed data
Return type:
-
normalization
(method='standard', key='numerical', split=True, **kwargs)[source]¶ Normalize data.
Parameters: - method – ‘standard’, ‘min-max’, None, ‘binarize’
- key – Subset of data to encode. Put to None for all columns. ‘numerical’ by default.
- split – If False, do the processing on the whole frame without train/test split.
Returns: Normalized data.
Return type:
-
pairplot
(key=None, max_features=12, force=False, save=None, **kwargs)[source]¶ Plot pairwise relationships between features.
Parameters: - key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
- max_features – Max number of features to pairplot.
- force – If True, plot the graphs even if the number of features is grater than max_features.
- save – Path/filename to save figure (if not None)
-
pca
(key=None, return_param=False, model=None, verbose=False, **kwargs)[source]¶ Compute Principal Components Analysis. Use kwargs for additional PCA parameters (cf. sklearn doc). /!WARNING: redundancy with ‘transform’ method.
Parameters: - key – Indexes key to select data.
- return_param – If True, returns a tuple (X, pca) to store PCA parameters and apply them later.
- model – Use this argument to pass a trained PCA model.
- verbose – Display additional information during run.
Return type: Returns: Transformed data
-
plot
(key=None, ad=None, c=None, save=None, **kwargs)[source]¶ Plot AutoData frame. * Distribution plot for 1D data * Scatter plot for 2D data * Heatmap for >2D data
For scatter plot, coloration is by default the class if possible, or can be defined with c parameter.
Parameters: - key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
- ad – AutoData frame to plot in superposition
- c – Sequence of color specifications of length n (e.g. data.get_data(‘y’))
- save – Path/filename to save figure (if not None)
-
processing
()[source]¶ Do basic processings in order to quickly be able to use the dataset. It is recommended to use the other processing methods (encoding, normalization) to use personalized parameters. This method is kept simple on purpose. Steps:
- Missing values imputation.
- Label encoding for categorical variables.
-
reduction
(method='pca', key=None, verbose=False, **kwargs)[source]¶ Dimensionality reduction.
Parameters: method – ‘pca’, ‘lda’, ‘tsne’, ‘hashing’, ‘factor’, ‘random’
-
score
(model=None, metric=None, method='baseline', test=None, fit=True, verbose=False)[source]¶ Benchmark, a.k.a. Utility. This method returns the score of a baseline on the dataset.
Return the metric score of a model trained and tested on data. If a test set is defined (‘test’ parameter), the model is trained on ‘data’ and tested on ‘test’.
Parameters: - model – Model to fit and test on data.
- metric – scoring function.
- method – ‘baseline’ or ‘auto’. Useful only if model is None.
- fit – If True, fit the model.
- test – Test is an optional DataFrame to use as the test set.
- verbose – If True, prints model information, classification report and metric function.
-
score_error_bars
(n=10, model=None, metric=None, method='baseline', test=None, fit=True, verbose=False)[source]¶ Call score method several times (same parameters) and return mean and variance.
-
set_class
(y=None)[source]¶ Procedure that defines one or several column(s) representing class / target / y.
Parameters: y – str or list of str representing column names to use as class(es). If y is None then the target is re-initialized (no class).
-
set_indexes
(key, value)[source]¶ Set an entry in the index.
For example: data.set_indexes(‘y’, ‘income’)
-
to_automl
(path='.', name='autodata')[source]¶ Write files in AutoML format. AutoML format is ideal to create a Codalab competition.
Parameters: - path – where to save the dataset
- name – name of the dataset to put into filenames
-
train_test_split
(test_size=0.3, shuffle=True, valid=False, valid_size=0.1, train_size=None)[source]¶ Procedure doing the train/test split and store it into self.indexes.
Parameters: - test_size – proportion of examples in test set.
- shuffle – whether to shuffle examples or not.
- valid – whether to do a train/valid/test split or not (not implemented yet).
- valid_size – proportion of example in validation set (not implemented yet).
- train_size – kind of alias which define test_size (1 - train_size).
-
transform
(model, key=None, return_param=False, **kwargs)[source]¶ General method to apply a transformation on data. It can be normalization, dimensionality reduction, etc. You can use kwargs for additional parameters of the transformer.
Parameters: - model – The transformer. It must have ‘fit’ and ‘transform’ methods. If it is an object instead of a class, model is considered to be an already fitted transformer.
- return_param – If true, returns the fitted transformer.
-
-
autopandas.autopandas.
compare_marginals
(ad1, ad2, **kwargs)[source]¶ Alias for marginals comparison plot.
-
autopandas.autopandas.
from_X_y
(X, y)[source]¶ Create an AutoData frame from a X and a y (class) DataFrames.
Module contents¶
Process, visualize and use data easily.