autopandas.utils package¶
Submodules¶
autopandas.utils.automl module¶
autopandas.utils.benchmark module¶
-
autopandas.utils.benchmark.score(data, model=None, metric=None, method='baseline', fit=True, test=None, average='weighted', verbose=False)[source]¶ Benchmark, a.k.a. Utility.
Return the metric score of a model trained and tested on data. If a test set is defined (‘test’ parameter), the model is trained on ‘data’ and tested on ‘test’.
Parameters: - model – Model to fit and test on data.
- metric – scoring function.
- method – ‘baseline’ or ‘auto’. Useful only if model is None.
- fit – If True, fit the model.
- test – Test is an optional DataFrame to use as the test set.
- average – Method for averaging the multi-class One-vs-One metrics scheme.
- verbose – If True, prints model information, classification report and metric function.
Return type: float
Returns: Metric score of the model trained and tested on data.
-
autopandas.utils.benchmark.score_error_bars(data, n=10, model=None, metric=None, method='baseline', fit=True, test=None, verbose=False)[source]¶ Run score method several times to compute error bars. The parameters are the same as score method. TODO: optimize computation. TODO: cross-val :return: mean and variance.
autopandas.utils.datasets module¶
autopandas.utils.encoding module¶
-
autopandas.utils.encoding.count(data, column, mapping=None, probability=False, return_param=False)[source]¶ Performs frequency encoding.
Categories are replaced by their number of occurence. Soon: possibility of probability instead of count (normalization)
Parameters: - df – Data
- column – Column to encode
- mapping – Dictionary {category : value}
- probability – If True, return probability instead of frequency
- return_param – If True, the mapping is returned
Returns: Encoded data
Return type: pd.DataFrame
-
autopandas.utils.encoding.frequency(columns, probability=False)[source]¶ - /!Warning: Take only column(s) and not DataFrame /! Frequency encoding:
- Pandas series to frequency/probability distribution.
If there are several series, the outputs will have the same format. .. rubric:: Example
C1: [‘b’, ‘a’, ‘a’, ‘b’, ‘b’] C2: [‘b’, ‘b’, ‘b’, ‘c’, ‘b’]
f1: [‘a’: 2, ‘b’; 3, ‘c’: 0] f2: [‘a’: 0, ‘b’; 4, ‘c’: 1]
Output: [[2, 3, 0], [0, 4, 1]] (with probability = False)
Parameters: probability – True for probablities, False for frequencies. Returns: Frequency/probability distribution. Return type: list
-
autopandas.utils.encoding.label(data, column)[source]¶ Performs label encoding.
Example
Color: [‘blue’, ‘green’, ‘blue’, ‘pink’] is encoded by Color: [1, 2, 1, 3]
Parameters: - data – Data
- column – Column to encode
Returns: Encoded data
Return type: pd.DataFrame
-
autopandas.utils.encoding.likelihood(data, column, mapping=None, return_param=False)[source]¶ Performs likelihood encoding.
Parameters: - df – Data
- column – Column to encode
- mapping – Dictionary {category : value}
- return_param – If True, the mapping is returned
Returns: Encoded data
Return type: pd.DataFrame
-
autopandas.utils.encoding.normalize(l, normalization='probability')[source]¶ Return a normalized list
Parameters: normalization – ‘probability’: between 0 and 1 with a sum equals to 1 OR ‘min-max’: min become 0 and max become 1
-
autopandas.utils.encoding.one_hot(data, column, rare=False, coeff=0.1)[source]¶ Performs one-hot encoding.
Example
Color: [‘black’, ‘white’, ‘white’] is encoded by Black: [1, 0, 0] White: [0, 1, 1]
Parameters: - df – Data
- column – Column to encode
- rare – If True, rare categories are merged into one
- coeff – Coefficient defining rare values. A rare category occurs less than the (average number of occurrence * coefficient).
Returns: Encoded data
Return type: pd.DataFrame
-
autopandas.utils.encoding.target(data, column, target, mapping=None, return_param=False)[source]¶ Performs target encoding.
Parameters: - df – Data
- column – Column to encode
- target – Target column name
- mapping – Dictionary {category : value}
- return_param – If True, the mapping is returned
Returns: Encoded data
Return type: pd.DataFrame
autopandas.utils.imputation module¶
-
autopandas.utils.imputation.infer(data, column, model=None, return_param=False, fit=True)[source]¶ Replace missing values by the most frequent value of the column.
Parameters: - data – AutoData data
- column – Column to impute
- model – Predictive model to train and use for imputation
- return_param – If True, returns a tuple (data, model) to store fitted model and apply it later
- fit – If True, fit the model
Returns: Imputed data
Return type:
-
autopandas.utils.imputation.mean(data, column)[source]¶ Replace missing values by the mean of the column.
Parameters: - data – AutoData data
- column – Column to impute
Returns: Imputed data
Return type:
-
autopandas.utils.imputation.median(data, column)[source]¶ Replace missing values by the median of the column.
Parameters: - data – AutoData data
- column – Column to impute
Returns: Imputed data
Return type:
-
autopandas.utils.imputation.most(data, column)[source]¶ Replace missing values by the most frequent value of the column.
Parameters: - data – AutoData data
- column – Column to impute
Returns: Imputed data
Return type:
autopandas.utils.metric module¶
-
autopandas.utils.metric.acc_stat(solution, prediction)[source]¶ Return accuracy statistics TN, FP, TP, FN Assumes that solution and prediction are binary 0/1 vectors.
-
autopandas.utils.metric.bac_metric(solution, prediction)[source]¶ Compute the balanced accuracy for binary classification.
-
autopandas.utils.metric.discriminant(data1, data2, model=None, metric=None, name1='Dataset 1', name2='Dataset 2', same_size=False, verbose=False)[source]¶ Return the scores of a classifier trained to differentiate data1 and data2.
If the distributions are similar and the model can’t distinguish then the score will be ~ 0.5 (depending on the metric of course).
Parameters: - model – The classifier. It has to have fit(X,y) and score(X,y) methods. Logistic regression by default.
- metric – The scoring metric. Accuracy by default.
- same_size – If True, normalize datasets to same size before computation.
Returns: Classification report (precision, recall, f1-score).
Return type: str
-
autopandas.utils.metric.distance(x, y, axis=None, norm='euclidean', method=None)[source]¶ Compute the distance between x and y (data points).
Default behaviour: flatten multi-dimensional arrays.
Parameters: - x – Array-like, first point
- y – Array-like, second point
- axis – Axis of x along which to compute the vector norms.
- norm – ‘l0’, ‘manhattan’, ‘euclidean’, ‘minimum’ or ‘maximum’
- method – Alias for norm parameter.
Returns: Distance value
Return type: float
-
autopandas.utils.metric.distance_matrix(data1, data2, distance_func=None)[source]¶ Compute matrix with distances between each points (m_ij is distance between i and j). TODO: parallelization.
Parameters: - data1 – Distribution
- data2 – Distribution
- distance_func – Distance metric function to use to compare data points. Euclidean distance by default.
-
autopandas.utils.metric.nn_discrepancy(X1, X2)[source]¶ Use 1 nearest neighbor method to determine discrepancy between X1 and X2. If X1 and X2 are very different, it is easy to classify them thus bac > 0.5. Otherwise, if they are similar, bac ~ 0.5.
-
autopandas.utils.metric.nnaa(data_s, data_t, distance_func=None, detailed_results=False)[source]¶ Compute nearest neighbors adversarial accuracy between data_s and data_t. This is the proportion of points in data_s for which the nearest neighbor is in data_s (and not in data_t). It can also be seen as the binary classification score of a 1NN trying to tell if a point is from data1 or data2, in a leave one out setting. If data_s and data_t follow the same distribution, the score should be near 0.5: * nnaa > 0.5 -> underfitting * nnaa ~ 0.5 -> cool * nnaa < 0.5 -> overfitting
From “Privacy Preserving Synthetic Health Data” by Andrew Yale et al.
Parameters: - data_s – 2D distribution (s for “source”).
- data_t – 2D distribution (t for “target”).
- distance_func – Distance metric function to use to compare data points. Euclidean distance by default.
- detailed_results – If True, return score but also score for TS and ST (the 2 components of the score).
autopandas.utils.normalization module¶
-
autopandas.utils.normalization.min_max(data, column, mini=None, maxi=None, return_param=False)[source]¶ Performs min-max normalization.
Parameters: - data – Data
- column – Column to normalize
- mini – Minimum, computed if not specified
- maxi – Maximum, computed if not specified
- return_param – if True, mean and std are returned
Returns: Normalized data
Return type:
-
autopandas.utils.normalization.standard(data, column, mean=None, std=None, return_param=False)[source]¶ Performs standard normalization.
Parameters: - data – Data
- column – Column to normalize
- mean – Mean, computed if not specified
- std – Standard deviation, computed if not specified
- return_param – if True, mean and std are returned
Returns: Normalized data
Return type:
autopandas.utils.reduction module¶
-
autopandas.utils.reduction.feature_hashing(data, key=None, n_features=10, **kwargs)[source]¶ Feature hashing.
Parameters: n_features – Wanted number of features after feature hashing.
-
autopandas.utils.reduction.lda(data, key=None, verbose=False, **kwargs)[source]¶ Compute Linear Discriminant Analysis. Use kwargs for additional LDA parameters (cf. sklearn doc)
Parameters: - key (Indexes key to select data.) –
- verbose (Display additional information during run.) –
Returns: Transformed data
Return type:
-
autopandas.utils.reduction.pca(data, key=None, return_param=False, verbose=False, model=None, **kwargs)[source]¶ Compute Principal Components Analysis. Use kwargs for additional PCA parameters (cf. sklearn doc).
Parameters: - key – Indexes key to select data.
- return_param – If True, returns a tuple (X, pca) to store PCA parameters and apply them later.
- model – Use this argument to pass a trained PCA model.
- verbose – Display additional information during run.
Return type: Returns: Transformed data
autopandas.utils.sdv module¶
-
autopandas.utils.sdv.decode(new_data, data, limits, min_max)[source]¶ Decode the data from SDV format.
Parameters: - data – Data in SDV format
- data – Original data
- limits – Limits returned by sdv.encode
- min_max – Min-max returned by sdv.encode
autopandas.utils.utilities module¶
autopandas.utils.visualization module¶
-
autopandas.utils.visualization.compare_marginals(data1, data2, key=None, method='all', target=None, save=None, name1='dataset 1', name2='dataset2')[source]¶ Plot the metric (e.g. mean) for each variable from data1 and data2. If the distributions are similar, the points will follow the y=x line. Mean, standard deviation or correlation with target. data1 and data2 has to have the same number of features.
Parameters: - method – ‘mean’, ‘std’, ‘corr’, ‘all’
- target – Column name for the target for correlation method
- save – Path to save the figure (doesn’t save if ‘save’ is None).
-
autopandas.utils.visualization.correlation(data, key=None, save=None, **kwargs)[source]¶ Plot correlation matrix.
Parameters: - key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
- save – Path/filename to save figure (if not None)
-
autopandas.utils.visualization.heatmap(data, key=None, save=None, **kwargs)[source]¶ Plot data heatmap.
Parameters: - key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
- save – Path/filename to save figure (if not None)
-
autopandas.utils.visualization.hierarchical_clustering(X, row_method='average', column_method='single', row_metric='euclidean', column_metric='euclidean', color_gradient='coolwarm')[source]¶ Show heatmap hierarchical clustering of X. This below code is based in large part on the protype methods: http://old.nabble.com/How-to-plot-heatmap-with-matplotlib–td32534593.html http://stackoverflow.com/questions/7664826/how-to-get-flat-clustering-corresponding-to-color-clusters-in-the-dendrogram-cre
X is an (m by n) np.ndarray, m observations, n genes.
-
autopandas.utils.visualization.pairplot(data, key=None, max_features=12, force=False, save=None, **kwargs)[source]¶ Plot pairwise relationships between features.
Parameters: - key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
- max_features – Max number of features to pairplot.
- force – If True, plot the graphs even if the number of features is grater than max_features.
- save – Path/filename to save figure (if not None)
-
autopandas.utils.visualization.plot(data, key=None, ad=None, c=None, save=None, names=None, cmap='viridis', **kwargs)[source]¶ Plot AutoData frame. * Distribution plot for 1D data * Scatter plot for 2D data * Heatmap for >2D data
For scatter plot, coloration is by default the class if possible, or can be defined with c parameter.
Parameters: - key – Key for subset selection (e.g. ‘X_train’ or ‘categorical’)
- ad – AutoData frame to plot in superposition
- c – Sequence of color specifications of length n (e.g. data.get_data(‘y’))
- save – Path/filename to save figure (if not None)