autopandas.utils package

Submodules

autopandas.utils.automl module

autopandas.utils.automl.from_automl(path)[source]

Read files in AutoML format. TODO

autopandas.utils.automl.read_automl(path)[source]

Alias for from_automl. Read files in AutoML format.

autopandas.utils.automl.to_automl(data, path='.', name='autodata')[source]

Write files in AutoML format. AutoML format is ideal to create a Codalab competition.

Parameters:
  • data – AutoData frame to format.
  • path – where to save the dataset
  • name – name of the dataset to put into filenames

autopandas.utils.benchmark module

autopandas.utils.benchmark.score(data, model=None, metric=None, method='baseline', fit=True, test=None, average='weighted', verbose=False)[source]

Benchmark, a.k.a. Utility.

Return the metric score of a model trained and tested on data. If a test set is defined (‘test’ parameter), the model is trained on ‘data’ and tested on ‘test’.

Parameters:
  • model – Model to fit and test on data.
  • metric – scoring function.
  • method – ‘baseline’ or ‘auto’. Useful only if model is None.
  • fit – If True, fit the model.
  • test – Test is an optional DataFrame to use as the test set.
  • average – Method for averaging the multi-class One-vs-One metrics scheme.
  • verbose – If True, prints model information, classification report and metric function.
Return type:

float

Returns:

Metric score of the model trained and tested on data.

autopandas.utils.benchmark.score_error_bars(data, n=10, model=None, metric=None, method='baseline', fit=True, test=None, verbose=False)[source]

Run score method several times to compute error bars. The parameters are the same as score method. TODO: optimize computation. TODO: cross-val :return: mean and variance.

autopandas.utils.datasets module

autopandas.utils.encoding module

autopandas.utils.encoding.count(data, column, mapping=None, probability=False, return_param=False)[source]

Performs frequency encoding.

Categories are replaced by their number of occurence. Soon: possibility of probability instead of count (normalization)

Parameters:
  • df – Data
  • column – Column to encode
  • mapping – Dictionary {category : value}
  • probability – If True, return probability instead of frequency
  • return_param – If True, the mapping is returned
Returns:

Encoded data

Return type:

pd.DataFrame

autopandas.utils.encoding.frequency(columns, probability=False)[source]
/!Warning: Take only column(s) and not DataFrame /! Frequency encoding:
Pandas series to frequency/probability distribution.

If there are several series, the outputs will have the same format. .. rubric:: Example

C1: [‘b’, ‘a’, ‘a’, ‘b’, ‘b’] C2: [‘b’, ‘b’, ‘b’, ‘c’, ‘b’]

f1: [‘a’: 2, ‘b’; 3, ‘c’: 0] f2: [‘a’: 0, ‘b’; 4, ‘c’: 1]

Output: [[2, 3, 0], [0, 4, 1]] (with probability = False)

Parameters:probability – True for probablities, False for frequencies.
Returns:Frequency/probability distribution.
Return type:list
autopandas.utils.encoding.label(data, column)[source]

Performs label encoding.

Example

Color: [‘blue’, ‘green’, ‘blue’, ‘pink’] is encoded by Color: [1, 2, 1, 3]

Parameters:
  • data – Data
  • column – Column to encode
Returns:

Encoded data

Return type:

pd.DataFrame

autopandas.utils.encoding.likelihood(data, column, mapping=None, return_param=False)[source]

Performs likelihood encoding.

Parameters:
  • df – Data
  • column – Column to encode
  • mapping – Dictionary {category : value}
  • return_param – If True, the mapping is returned
Returns:

Encoded data

Return type:

pd.DataFrame

autopandas.utils.encoding.none(data, column)[source]

Remove column from data.

autopandas.utils.encoding.normalize(l, normalization='probability')[source]

Return a normalized list

Parameters:normalization – ‘probability’: between 0 and 1 with a sum equals to 1 OR ‘min-max’: min become 0 and max become 1
autopandas.utils.encoding.one_hot(data, column, rare=False, coeff=0.1)[source]

Performs one-hot encoding.

Example

Color: [‘black’, ‘white’, ‘white’] is encoded by Black: [1, 0, 0] White: [0, 1, 1]

Parameters:
  • df – Data
  • column – Column to encode
  • rare – If True, rare categories are merged into one
  • coeff – Coefficient defining rare values. A rare category occurs less than the (average number of occurrence * coefficient).
Returns:

Encoded data

Return type:

pd.DataFrame

autopandas.utils.encoding.target(data, column, target, mapping=None, return_param=False)[source]

Performs target encoding.

Parameters:
  • df – Data
  • column – Column to encode
  • target – Target column name
  • mapping – Dictionary {category : value}
  • return_param – If True, the mapping is returned
Returns:

Encoded data

Return type:

pd.DataFrame

autopandas.utils.imputation module

autopandas.utils.imputation.infer(data, column, model=None, return_param=False, fit=True)[source]

Replace missing values by the most frequent value of the column.

Parameters:
  • data – AutoData data
  • column – Column to impute
  • model – Predictive model to train and use for imputation
  • return_param – If True, returns a tuple (data, model) to store fitted model and apply it later
  • fit – If True, fit the model
Returns:

Imputed data

Return type:

autopandas.AutoData

autopandas.utils.imputation.mean(data, column)[source]

Replace missing values by the mean of the column.

Parameters:
  • data – AutoData data
  • column – Column to impute
Returns:

Imputed data

Return type:

autopandas.AutoData

autopandas.utils.imputation.median(data, column)[source]

Replace missing values by the median of the column.

Parameters:
  • data – AutoData data
  • column – Column to impute
Returns:

Imputed data

Return type:

autopandas.AutoData

autopandas.utils.imputation.most(data, column)[source]

Replace missing values by the most frequent value of the column.

Parameters:
  • data – AutoData data
  • column – Column to impute
Returns:

Imputed data

Return type:

autopandas.AutoData

autopandas.utils.imputation.remove(data, columns)[source]

Simply remove columns containing missing values.

Parameters:
  • data – AutoData data
  • columns – Column(s) to impute
Returns:

Imputed data

Return type:

autopandas.AutoData

autopandas.utils.metric module

autopandas.utils.metric.acc_stat(solution, prediction)[source]

Return accuracy statistics TN, FP, TP, FN Assumes that solution and prediction are binary 0/1 vectors.

autopandas.utils.metric.bac_metric(solution, prediction)[source]

Compute the balanced accuracy for binary classification.

autopandas.utils.metric.discriminant(data1, data2, model=None, metric=None, name1='Dataset 1', name2='Dataset 2', same_size=False, verbose=False)[source]

Return the scores of a classifier trained to differentiate data1 and data2.

If the distributions are similar and the model can’t distinguish then the score will be ~ 0.5 (depending on the metric of course).

Parameters:
  • model – The classifier. It has to have fit(X,y) and score(X,y) methods. Logistic regression by default.
  • metric – The scoring metric. Accuracy by default.
  • same_size – If True, normalize datasets to same size before computation.
Returns:

Classification report (precision, recall, f1-score).

Return type:

str

autopandas.utils.metric.distance(x, y, axis=None, norm='euclidean', method=None)[source]

Compute the distance between x and y (data points).

Default behaviour: flatten multi-dimensional arrays.

Parameters:
  • x – Array-like, first point
  • y – Array-like, second point
  • axis – Axis of x along which to compute the vector norms.
  • norm – ‘l0’, ‘manhattan’, ‘euclidean’, ‘minimum’ or ‘maximum’
  • method – Alias for norm parameter.
Returns:

Distance value

Return type:

float

autopandas.utils.metric.distance_matrix(data1, data2, distance_func=None)[source]

Compute matrix with distances between each points (m_ij is distance between i and j). TODO: parallelization.

Parameters:
  • data1 – Distribution
  • data2 – Distribution
  • distance_func – Distance metric function to use to compare data points. Euclidean distance by default.
autopandas.utils.metric.nn_discrepancy(X1, X2)[source]

Use 1 nearest neighbor method to determine discrepancy between X1 and X2. If X1 and X2 are very different, it is easy to classify them thus bac > 0.5. Otherwise, if they are similar, bac ~ 0.5.

autopandas.utils.metric.nnaa(data_s, data_t, distance_func=None, detailed_results=False)[source]

Compute nearest neighbors adversarial accuracy between data_s and data_t. This is the proportion of points in data_s for which the nearest neighbor is in data_s (and not in data_t). It can also be seen as the binary classification score of a 1NN trying to tell if a point is from data1 or data2, in a leave one out setting. If data_s and data_t follow the same distribution, the score should be near 0.5: * nnaa > 0.5 -> underfitting * nnaa ~ 0.5 -> cool * nnaa < 0.5 -> overfitting

From “Privacy Preserving Synthetic Health Data” by Andrew Yale et al.

Parameters:
  • data_s – 2D distribution (s for “source”).
  • data_t – 2D distribution (t for “target”).
  • distance_func – Distance metric function to use to compare data points. Euclidean distance by default.
  • detailed_results – If True, return score but also score for TS and ST (the 2 components of the score).

autopandas.utils.normalization module

autopandas.utils.normalization.min_max(data, column, mini=None, maxi=None, return_param=False)[source]

Performs min-max normalization.

Parameters:
  • data – Data
  • column – Column to normalize
  • mini – Minimum, computed if not specified
  • maxi – Maximum, computed if not specified
  • return_param – if True, mean and std are returned
Returns:

Normalized data

Return type:

autopandas.AutoData

autopandas.utils.normalization.standard(data, column, mean=None, std=None, return_param=False)[source]

Performs standard normalization.

Parameters:
  • data – Data
  • column – Column to normalize
  • mean – Mean, computed if not specified
  • std – Standard deviation, computed if not specified
  • return_param – if True, mean and std are returned
Returns:

Normalized data

Return type:

autopandas.AutoData

autopandas.utils.reduction module

autopandas.utils.reduction.feature_hashing(data, key=None, n_features=10, **kwargs)[source]

Feature hashing.

Parameters:n_features – Wanted number of features after feature hashing.
autopandas.utils.reduction.lda(data, key=None, verbose=False, **kwargs)[source]

Compute Linear Discriminant Analysis. Use kwargs for additional LDA parameters (cf. sklearn doc)

Parameters:
  • key (Indexes key to select data.) –
  • verbose (Display additional information during run.) –
Returns:

Transformed data

Return type:

AutoData

autopandas.utils.reduction.pca(data, key=None, return_param=False, verbose=False, model=None, **kwargs)[source]

Compute Principal Components Analysis. Use kwargs for additional PCA parameters (cf. sklearn doc).

Parameters:
  • key – Indexes key to select data.
  • return_param – If True, returns a tuple (X, pca) to store PCA parameters and apply them later.
  • model – Use this argument to pass a trained PCA model.
  • verbose – Display additional information during run.
Return type:

autopandas.AutoData

Returns:

Transformed data

autopandas.utils.reduction.tsne(data, key=None, verbose=False, **kwargs)[source]

Compute T-SNE. Use kwargs for additional T-SNE parameters (cf. sklearn doc)

Parameters:
  • key (Indexes key to select data.) –
  • verbose (Display additional information during run.) –
Returns:

Transformed data

Return type:

AutoData

autopandas.utils.sdv module

autopandas.utils.sdv.categorical(column)[source]

Convert a categorical column to continuous.

autopandas.utils.sdv.decode(new_data, data, limits, min_max)[source]

Decode the data from SDV format.

Parameters:
  • data – Data in SDV format
  • data – Original data
  • limits – Limits returned by sdv.encode
  • min_max – Min-max returned by sdv.encode
autopandas.utils.sdv.encode(data)[source]

Encode the data into SDV format.

autopandas.utils.sdv.numeric(column)[source]

Normalize a numerical column.

autopandas.utils.sdv.undo_categorical(column, lim)[source]

Convert a categorical column to continuous.

autopandas.utils.sdv.undo_numeric(column, min_column, max_column)[source]

Normalize a numerical column.

autopandas.utils.utilities module

autopandas.utils.visualization module

autopandas.utils.visualization.compare_marginals(data1, data2, key=None, method='all', target=None, save=None, name1='dataset 1', name2='dataset2')[source]

Plot the metric (e.g. mean) for each variable from data1 and data2. If the distributions are similar, the points will follow the y=x line. Mean, standard deviation or correlation with target. data1 and data2 has to have the same number of features.

Parameters:
  • method – ‘mean’, ‘std’, ‘corr’, ‘all’
  • target – Column name for the target for correlation method
  • save – Path to save the figure (doesn’t save if ‘save’ is None).
autopandas.utils.visualization.correlation(data, key=None, save=None, **kwargs)[source]

Plot correlation matrix.

Parameters:
  • key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
  • save – Path/filename to save figure (if not None)
autopandas.utils.visualization.heatmap(data, key=None, save=None, **kwargs)[source]

Plot data heatmap.

Parameters:
  • key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
  • save – Path/filename to save figure (if not None)
autopandas.utils.visualization.hierarchical_clustering(X, row_method='average', column_method='single', row_metric='euclidean', column_metric='euclidean', color_gradient='coolwarm')[source]

Show heatmap hierarchical clustering of X. This below code is based in large part on the protype methods: http://old.nabble.com/How-to-plot-heatmap-with-matplotlib–td32534593.html http://stackoverflow.com/questions/7664826/how-to-get-flat-clustering-corresponding-to-color-clusters-in-the-dendrogram-cre

X is an (m by n) np.ndarray, m observations, n genes.

autopandas.utils.visualization.pairplot(data, key=None, max_features=12, force=False, save=None, **kwargs)[source]

Plot pairwise relationships between features.

Parameters:
  • key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
  • max_features – Max number of features to pairplot.
  • force – If True, plot the graphs even if the number of features is grater than max_features.
  • save – Path/filename to save figure (if not None)
autopandas.utils.visualization.plot(data, key=None, ad=None, c=None, save=None, names=None, cmap='viridis', **kwargs)[source]

Plot AutoData frame. * Distribution plot for 1D data * Scatter plot for 2D data * Heatmap for >2D data

For scatter plot, coloration is by default the class if possible, or can be defined with c parameter.

Parameters:
  • key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
  • ad – AutoData frame to plot in superposition
  • c – Sequence of color specifications of length n (e.g. data.get_data(‘y’))
  • save – Path/filename to save figure (if not None)

Module contents