autopandas.utils package¶

Submodules¶

autopandas.utils.automl module¶

autopandas.utils.automl.from_automl(path)[source]¶: Read files in AutoML format. TODO

autopandas.utils.automl.read_automl(path)[source]¶: Alias for from_automl. Read files in AutoML format.

autopandas.utils.automl.to_automl(data, path='.', name='autodata')[source]¶

Write files in AutoML format. AutoML format is ideal to create a Codalab competition.

Parameters:	data – AutoData frame to format. path – where to save the dataset name – name of the dataset to put into filenames

autopandas.utils.benchmark module¶

autopandas.utils.benchmark.score(data, model=None, metric=None, method='baseline', fit=True, test=None, average='weighted', verbose=False)[source]¶

Benchmark, a.k.a. Utility.

Return the metric score of a model trained and tested on data. If a test set is defined (‘test’ parameter), the model is trained on ‘data’ and tested on ‘test’.

Parameters:	model – Model to fit and test on data. metric – scoring function. method – ‘baseline’ or ‘auto’. Useful only if model is None. fit – If True, fit the model. test – Test is an optional DataFrame to use as the test set. average – Method for averaging the multi-class One-vs-One metrics scheme. verbose – If True, prints model information, classification report and metric function.
Return type:	float
Returns:	Metric score of the model trained and tested on data.

autopandas.utils.benchmark.score_error_bars(data, n=10, model=None, metric=None, method='baseline', fit=True, test=None, verbose=False)[source]¶: Run score method several times to compute error bars. The parameters are the same as score method. TODO: optimize computation. TODO: cross-val :return: mean and variance.

autopandas.utils.datasets module¶

autopandas.utils.encoding module¶

autopandas.utils.encoding.count(data, column, mapping=None, probability=False, return_param=False)[source]¶

Performs frequency encoding.

Categories are replaced by their number of occurence. Soon: possibility of probability instead of count (normalization)

Parameters:	df – Data column – Column to encode mapping – Dictionary {category : value} probability – If True, return probability instead of frequency return_param – If True, the mapping is returned
Returns:	Encoded data
Return type:	pd.DataFrame

autopandas.utils.encoding.frequency(columns, probability=False)[source]¶

/!Warning: Take only column(s) and not DataFrame /! Frequency encoding:: Pandas series to frequency/probability distribution.

If there are several series, the outputs will have the same format. .. rubric:: Example

C1: [‘b’, ‘a’, ‘a’, ‘b’, ‘b’] C2: [‘b’, ‘b’, ‘b’, ‘c’, ‘b’]

f1: [‘a’: 2, ‘b’; 3, ‘c’: 0] f2: [‘a’: 0, ‘b’; 4, ‘c’: 1]

Output: [[2, 3, 0], [0, 4, 1]] (with probability = False)

Parameters:	probability – True for probablities, False for frequencies.
Returns:	Frequency/probability distribution.
Return type:	list

autopandas.utils.encoding.label(data, column)[source]¶

Performs label encoding.

Example

Color: [‘blue’, ‘green’, ‘blue’, ‘pink’] is encoded by Color: [1, 2, 1, 3]

Parameters:	data – Data column – Column to encode
Returns:	Encoded data
Return type:	pd.DataFrame

autopandas.utils.encoding.likelihood(data, column, mapping=None, return_param=False)[source]¶

Performs likelihood encoding.

Parameters:	df – Data column – Column to encode mapping – Dictionary {category : value} return_param – If True, the mapping is returned
Returns:	Encoded data
Return type:	pd.DataFrame

autopandas.utils.encoding.none(data, column)[source]¶: Remove column from data.

autopandas.utils.encoding.normalize(l, normalization='probability')[source]¶

Return a normalized list

Parameters:	normalization – ‘probability’: between 0 and 1 with a sum equals to 1 OR ‘min-max’: min become 0 and max become 1

autopandas.utils.encoding.one_hot(data, column, rare=False, coeff=0.1)[source]¶

Performs one-hot encoding.

Example

Color: [‘black’, ‘white’, ‘white’] is encoded by Black: [1, 0, 0] White: [0, 1, 1]

Parameters:	df – Data column – Column to encode rare – If True, rare categories are merged into one coeff – Coefficient defining rare values. A rare category occurs less than the (average number of occurrence * coefficient).
Returns:	Encoded data
Return type:	pd.DataFrame

autopandas.utils.encoding.target(data, column, target, mapping=None, return_param=False)[source]¶

Performs target encoding.

Parameters:	df – Data column – Column to encode target – Target column name mapping – Dictionary {category : value} return_param – If True, the mapping is returned
Returns:	Encoded data
Return type:	pd.DataFrame

autopandas.utils.imputation module¶

autopandas.utils.imputation.infer(data, column, model=None, return_param=False, fit=True)[source]¶

Replace missing values by the most frequent value of the column.

Parameters:	data – AutoData data column – Column to impute model – Predictive model to train and use for imputation return_param – If True, returns a tuple (data, model) to store fitted model and apply it later fit – If True, fit the model
Returns:	Imputed data
Return type:	autopandas.AutoData

autopandas.utils.imputation.mean(data, column)[source]¶

Replace missing values by the mean of the column.

Parameters:	data – AutoData data column – Column to impute
Returns:	Imputed data
Return type:	autopandas.AutoData

autopandas.utils.imputation.median(data, column)[source]¶

Replace missing values by the median of the column.

Parameters:	data – AutoData data column – Column to impute
Returns:	Imputed data
Return type:	autopandas.AutoData

autopandas.utils.imputation.most(data, column)[source]¶

Replace missing values by the most frequent value of the column.

Parameters:	data – AutoData data column – Column to impute
Returns:	Imputed data
Return type:	autopandas.AutoData

autopandas.utils.imputation.remove(data, columns)[source]¶

Simply remove columns containing missing values.

Parameters:	data – AutoData data columns – Column(s) to impute
Returns:	Imputed data
Return type:	autopandas.AutoData

autopandas.utils.metric module¶

autopandas.utils.metric.acc_stat(solution, prediction)[source]¶: Return accuracy statistics TN, FP, TP, FN Assumes that solution and prediction are binary 0/1 vectors.

autopandas.utils.metric.bac_metric(solution, prediction)[source]¶: Compute the balanced accuracy for binary classification.

autopandas.utils.metric.discriminant(data1, data2, model=None, metric=None, name1='Dataset 1', name2='Dataset 2', same_size=False, verbose=False)[source]¶

Return the scores of a classifier trained to differentiate data1 and data2.

If the distributions are similar and the model can’t distinguish then the score will be ~ 0.5 (depending on the metric of course).

Parameters:	model – The classifier. It has to have fit(X,y) and score(X,y) methods. Logistic regression by default. metric – The scoring metric. Accuracy by default. same_size – If True, normalize datasets to same size before computation.
Returns:	Classification report (precision, recall, f1-score).
Return type:	str

autopandas.utils.metric.distance(x, y, axis=None, norm='euclidean', method=None)[source]¶

Compute the distance between x and y (data points).

Default behaviour: flatten multi-dimensional arrays.

Parameters:	x – Array-like, first point y – Array-like, second point axis – Axis of x along which to compute the vector norms. norm – ‘l0’, ‘manhattan’, ‘euclidean’, ‘minimum’ or ‘maximum’ method – Alias for norm parameter.
Returns:	Distance value
Return type:	float

autopandas.utils.metric.distance_matrix(data1, data2, distance_func=None)[source]¶

Compute matrix with distances between each points (m_ij is distance between i and j). TODO: parallelization.

Parameters:	data1 – Distribution data2 – Distribution distance_func – Distance metric function to use to compare data points. Euclidean distance by default.

autopandas.utils.metric.nn_discrepancy(X1, X2)[source]¶: Use 1 nearest neighbor method to determine discrepancy between X1 and X2. If X1 and X2 are very different, it is easy to classify them thus bac > 0.5. Otherwise, if they are similar, bac ~ 0.5.

autopandas.utils.metric.nnaa(data_s, data_t, distance_func=None, detailed_results=False)[source]¶

Compute nearest neighbors adversarial accuracy between data_s and data_t. This is the proportion of points in data_s for which the nearest neighbor is in data_s (and not in data_t). It can also be seen as the binary classification score of a 1NN trying to tell if a point is from data1 or data2, in a leave one out setting. If data_s and data_t follow the same distribution, the score should be near 0.5: * nnaa > 0.5 -> underfitting * nnaa ~ 0.5 -> cool * nnaa < 0.5 -> overfitting

From “Privacy Preserving Synthetic Health Data” by Andrew Yale et al.

Parameters:	data_s – 2D distribution (s for “source”). data_t – 2D distribution (t for “target”). distance_func – Distance metric function to use to compare data points. Euclidean distance by default. detailed_results – If True, return score but also score for TS and ST (the 2 components of the score).

autopandas.utils.normalization module¶

autopandas.utils.normalization.min_max(data, column, mini=None, maxi=None, return_param=False)[source]¶

Performs min-max normalization.

Parameters:	data – Data column – Column to normalize mini – Minimum, computed if not specified maxi – Maximum, computed if not specified return_param – if True, mean and std are returned
Returns:	Normalized data
Return type:	autopandas.AutoData

autopandas.utils.normalization.standard(data, column, mean=None, std=None, return_param=False)[source]¶

Performs standard normalization.

Parameters:	data – Data column – Column to normalize mean – Mean, computed if not specified std – Standard deviation, computed if not specified return_param – if True, mean and std are returned
Returns:	Normalized data
Return type:	autopandas.AutoData

autopandas.utils.reduction module¶

autopandas.utils.reduction.feature_hashing(data, key=None, n_features=10, **kwargs)[source]¶

Feature hashing.

Parameters:	n_features – Wanted number of features after feature hashing.

autopandas.utils.reduction.lda(data, key=None, verbose=False, **kwargs)[source]¶

Compute Linear Discriminant Analysis. Use kwargs for additional LDA parameters (cf. sklearn doc)

Parameters:	key (Indexes key to select data.) – verbose (Display additional information during run.) –
Returns:	Transformed data
Return type:	AutoData

autopandas.utils.reduction.pca(data, key=None, return_param=False, verbose=False, model=None, **kwargs)[source]¶

Compute Principal Components Analysis. Use kwargs for additional PCA parameters (cf. sklearn doc).

Parameters:	key – Indexes key to select data. return_param – If True, returns a tuple (X, pca) to store PCA parameters and apply them later. model – Use this argument to pass a trained PCA model. verbose – Display additional information during run.
Return type:	autopandas.AutoData
Returns:	Transformed data

autopandas.utils.reduction.tsne(data, key=None, verbose=False, **kwargs)[source]¶

Compute T-SNE. Use kwargs for additional T-SNE parameters (cf. sklearn doc)

Parameters:	key (Indexes key to select data.) – verbose (Display additional information during run.) –
Returns:	Transformed data
Return type:	AutoData

autopandas.utils.sdv module¶

autopandas.utils.sdv.categorical(column)[source]¶: Convert a categorical column to continuous.

autopandas.utils.sdv.decode(new_data, data, limits, min_max)[source]¶

Decode the data from SDV format.

Parameters:	data – Data in SDV format data – Original data limits – Limits returned by sdv.encode min_max – Min-max returned by sdv.encode

autopandas.utils.sdv.encode(data)[source]¶: Encode the data into SDV format.

autopandas.utils.sdv.numeric(column)[source]¶: Normalize a numerical column.

autopandas.utils.sdv.undo_categorical(column, lim)[source]¶: Convert a categorical column to continuous.

autopandas.utils.sdv.undo_numeric(column, min_column, max_column)[source]¶: Normalize a numerical column.

autopandas.utils.utilities module¶

autopandas.utils.visualization module¶

autopandas.utils.visualization.compare_marginals(data1, data2, key=None, method='all', target=None, save=None, name1='dataset 1', name2='dataset2')[source]¶

Plot the metric (e.g. mean) for each variable from data1 and data2. If the distributions are similar, the points will follow the y=x line. Mean, standard deviation or correlation with target. data1 and data2 has to have the same number of features.

Parameters:	method – ‘mean’, ‘std’, ‘corr’, ‘all’ target – Column name for the target for correlation method save – Path to save the figure (doesn’t save if ‘save’ is None).

autopandas.utils.visualization.correlation(data, key=None, save=None, **kwargs)[source]¶

Plot correlation matrix.

Parameters:	key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’) save – Path/filename to save figure (if not None)

autopandas.utils.visualization.heatmap(data, key=None, save=None, **kwargs)[source]¶

Plot data heatmap.

Parameters:	key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’) save – Path/filename to save figure (if not None)

autopandas.utils.visualization.hierarchical_clustering(X, row_method='average', column_method='single', row_metric='euclidean', column_metric='euclidean', color_gradient='coolwarm')[source]¶

Show heatmap hierarchical clustering of X. This below code is based in large part on the protype methods: http://old.nabble.com/How-to-plot-heatmap-with-matplotlib–td32534593.html http://stackoverflow.com/questions/7664826/how-to-get-flat-clustering-corresponding-to-color-clusters-in-the-dendrogram-cre

X is an (m by n) np.ndarray, m observations, n genes.

autopandas.utils.visualization.pairplot(data, key=None, max_features=12, force=False, save=None, **kwargs)[source]¶

Plot pairwise relationships between features.

Parameters:	key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’) max_features – Max number of features to pairplot. force – If True, plot the graphs even if the number of features is grater than max_features. save – Path/filename to save figure (if not None)

autopandas.utils.visualization.plot(data, key=None, ad=None, c=None, save=None, names=None, cmap='viridis', **kwargs)[source]¶

Plot AutoData frame. * Distribution plot for 1D data * Scatter plot for 2D data * Heatmap for >2D data

For scatter plot, coloration is by default the class if possible, or can be defined with c parameter.

Parameters:	key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’) ad – AutoData frame to plot in superposition c – Sequence of color specifications of length n (e.g. data.get_data(‘y’)) save – Path/filename to save figure (if not None)