autopandas package

Submodules

autopandas.autopandas module

class autopandas.autopandas.AutoData(*args, indexes=None, **kwargs)[source]

Bases: pandas.core.frame.DataFrame

__init__(*args, indexes=None, **kwargs)[source]

AutoData is a data structure extending Pandas DataFrame. The goal is to quickly get to grips with a dataset. An AutoData object represents a 2D data frame with:

  • Examples in rows
  • Features in columns

If needed, automatically do: * numerical/categorical variables inference * train/test split

categorical_ratio()[source]

Alias for symbolic_ratio method.

class_deviation()[source]
compare_marginals(ad, **kwargs)[source]
copy()[source]

Re-defines copy to keep indexes from one copy to another.

correlation(**kwargs)[source]
descriptors()[source]

All in one.

distance(data, method=None, **kwargs)[source]

Distance between two AutoData frames. TODO: There are methods to add (cf. utilities.py and metric.py) Usage example: ad1.distance(ad2, method=’privacy’)

Parameters:
  • data – Second distribution to compare with
  • method – ‘none’ (nn_discrepancy), ‘discriminant’
encode(**kwargs)[source]

Alias for encoding method.

encoding(method='label', key='categorical', target=None, split=True, **kwargs)[source]

Encode (categorical) variables.

Parameters:
  • method – ‘none’, ‘label’, ‘one-hot’, ‘rare-one-hot’, ‘target’, ‘likelihood’, ‘count’, ‘probability’, ‘binarize’
  • key – Subset of data to encode. Put to None for all columns. ‘categorical’ by default.
  • target – Target column name (target encoding).
  • coeff – Coefficient defining rare values (rare one-hot encoding). A rare category occurs less than the (average number of occurrence * coefficient).
  • split – If False, do the processing on the whole frame without train/test split.
flush_index(key=None, compute_types=True)[source]

Delete non-existing columns from indexes.

#Delete useless indexes for a specific set. #For example: # key=’X_train’ # -> delete ‘test’ and ‘y’ indexes #Maybe useless.

generate(method=None)[source]

Fit a generator and generate data with default/kwargs parameters. TODO

Parameters:method – ANM, GAN, VAE, Copula, etc.
get_data(key=None)[source]

Get data.

Parameters:key – wanted subset of data (‘train’, ‘categorical_header’, ‘y’, etc.)
get_index(key=None)[source]

Return rows, columns.

get_task()[source]

Return ‘regression’ or ‘classification’ regarding the target type. TODO: multiclass?

get_types()[source]

Compute variables types: Numeric or Categorical. This information is then stored as indexes with key ‘numerical’ and ‘categorical’.

has_class()[source]

Return True if ‘y’ is defined and corresponds to one column (or more).

has_split()[source]

Return True if ‘train’ and ‘test’ are defined (or more).

heatmap(**kwargs)[source]
hierarchical_clustering(**kwargs)[source]
imputation(method='most', key=None, model=None, split=True, fit=True)[source]

Impute missing values.

Parameters:
  • method – None, ‘remove’, ‘most’, ‘mean’, ‘median’, ‘infer’
  • model – Predictive model (in case of ‘infer’ method)
  • fit – If True, fit the model (in case of ‘infer’ method)
  • split – If False, do the processing on the whole frame without train/test split.
Returns:

Data with imputed values.

Return type:

AutoData

impute(**kwargs)[source]

Alias for imputation method.

input_size()[source]

Number of features in X.

lda(key=None, verbose=False, **kwargs)[source]

Compute Linear Discriminant Analysis.

Parameters:
  • verbose – Display additional information during run
  • **kwargs

    Additional parameters for LDA (see sklearn doc)

Returns:

Transformed data

Return type:

AutoData

missing_ratio()[source]

Ratio of missing values.

normalization(method='standard', key='numerical', split=True, **kwargs)[source]

Normalize data.

Parameters:
  • method – ‘standard’, ‘min-max’, None, ‘binarize’
  • key – Subset of data to encode. Put to None for all columns. ‘numerical’ by default.
  • split – If False, do the processing on the whole frame without train/test split.
Returns:

Normalized data.

Return type:

AutoData

normalize(**kwargs)[source]

Alias for normalization method.

output_size()[source]

Number of unique classes.

pairplot(key=None, max_features=12, force=False, save=None, **kwargs)[source]

Plot pairwise relationships between features.

Parameters:
  • key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
  • max_features – Max number of features to pairplot.
  • force – If True, plot the graphs even if the number of features is grater than max_features.
  • save – Path/filename to save figure (if not None)
pca(key=None, return_param=False, model=None, verbose=False, **kwargs)[source]

Compute Principal Components Analysis. Use kwargs for additional PCA parameters (cf. sklearn doc). /!WARNING: redundancy with ‘transform’ method.

Parameters:
  • key – Indexes key to select data.
  • return_param – If True, returns a tuple (X, pca) to store PCA parameters and apply them later.
  • model – Use this argument to pass a trained PCA model.
  • verbose – Display additional information during run.
Return type:

AutoData

Returns:

Transformed data

plot(key=None, ad=None, c=None, save=None, **kwargs)[source]

Plot AutoData frame. * Distribution plot for 1D data * Scatter plot for 2D data * Heatmap for >2D data

For scatter plot, coloration is by default the class if possible, or can be defined with c parameter.

Parameters:
  • key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
  • ad – AutoData frame to plot in superposition
  • c – Sequence of color specifications of length n (e.g. data.get_data(‘y’))
  • save – Path/filename to save figure (if not None)
plot_pca(key)[source]
process()[source]

Alias for processing method.

processing()[source]

Do basic processings in order to quickly be able to use the dataset. It is recommended to use the other processing methods (encoding, normalization) to use personalized parameters. This method is kept simple on purpose. Steps:

  1. Missing values imputation.
  2. Label encoding for categorical variables.
ratio(key=None)[source]

Dataset ratio: (dimension / number of examples).

reduction(method='pca', key=None, verbose=False, **kwargs)[source]

Dimensionality reduction.

Parameters:method – ‘pca’, ‘lda’, ‘tsne’, ‘hashing’, ‘factor’, ‘random’
score(model=None, metric=None, method='baseline', test=None, fit=True, verbose=False)[source]

Benchmark, a.k.a. Utility. This method returns the score of a baseline on the dataset.

Return the metric score of a model trained and tested on data. If a test set is defined (‘test’ parameter), the model is trained on ‘data’ and tested on ‘test’.

Parameters:
  • model – Model to fit and test on data.
  • metric – scoring function.
  • method – ‘baseline’ or ‘auto’. Useful only if model is None.
  • fit – If True, fit the model.
  • test – Test is an optional DataFrame to use as the test set.
  • verbose – If True, prints model information, classification report and metric function.
score_error_bars(n=10, model=None, metric=None, method='baseline', test=None, fit=True, verbose=False)[source]

Call score method several times (same parameters) and return mean and variance.

set_class(y=None)[source]

Procedure that defines one or several column(s) representing class / target / y.

Parameters:y – str or list of str representing column names to use as class(es). If y is None then the target is re-initialized (no class).
set_indexes(key, value)[source]

Set an entry in the index.

For example: data.set_indexes(‘y’, ‘income’)

sparsity()[source]

Compute the proportion of zeros.

symbolic_ratio()[source]

Ratio of symbolic attributes.

task()[source]

Alias for get_task method. Return ‘regression’ or ‘classification’.

to_automl(path='.', name='autodata')[source]

Write files in AutoML format. AutoML format is ideal to create a Codalab competition.

Parameters:
  • path – where to save the dataset
  • name – name of the dataset to put into filenames
to_csv(*args, **kwargs)[source]

Write data into a CSV file. Index is not written by default.

train_test_split(test_size=0.3, shuffle=True, valid=False, valid_size=0.1, train_size=None)[source]

Procedure doing the train/test split and store it into self.indexes.

Parameters:
  • test_size – proportion of examples in test set.
  • shuffle – whether to shuffle examples or not.
  • valid – whether to do a train/valid/test split or not (not implemented yet).
  • valid_size – proportion of example in validation set (not implemented yet).
  • train_size – kind of alias which define test_size (1 - train_size).
transform(model, key=None, return_param=False, **kwargs)[source]

General method to apply a transformation on data. It can be normalization, dimensionality reduction, etc. You can use kwargs for additional parameters of the transformer.

Parameters:
  • model – The transformer. It must have ‘fit’ and ‘transform’ methods. If it is an object instead of a class, model is considered to be an already fitted transformer.
  • return_param – If true, returns the fitted transformer.
tsne(key=None, verbose=False, **kwargs)[source]

Compute T-SNE.

Parameters:
  • verbose – Display additional information during run
  • **kwargs

    Additional parameters for T-SNE (see sklearn doc)

Returns:

Transformed data

Return type:

AutoData

autopandas.autopandas.compare_marginals(ad1, ad2, **kwargs)[source]

Alias for marginals comparison plot.

autopandas.autopandas.distance(ad1, ad2, method=None)[source]

Alias for distance between AutoData.

autopandas.autopandas.from_X_y(X, y)[source]

Create an AutoData frame from a X and a y (class) DataFrames.

autopandas.autopandas.from_train_test(train, test)[source]

Create an AutoData frame from a train and a test DataFrames.

autopandas.autopandas.plot(ad1, ad2, **kwargs)[source]

Alias for double plot.

autopandas.autopandas.read_csv(*args, **kwargs)[source]

Read data from CSV file. Default behaviour is to infer the separator. It then creates an AutoData object, so the numerical/categorical inference and train/test split are done automatically.

Module contents

Process, visualize and use data easily.