API Reference

saiph

saiph.fit(df: DataFrame, nf: int | None = None, col_weights: Dict[str, int | float] | None = None, sparse: bool = False) → Model

Fit a PCA, MCA or FAMD model on data, imputing what has to be used.

Datetimes must be stored as numbers of seconds since epoch.

Parameters:

df – Data to project.
nf – Number of components to keep. default: None, which uses all columns.
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data.

Return type:

model

saiph.fit_transform(df: DataFrame, nf: int | None = None, col_weights: Dict[str, int | float] | None = None) → Tuple[DataFrame, Model]

Fit a PCA, MCA or FAMD model on data, imputing what has to be used.

Datetimes must be stored as numbers of seconds since epoch.

Parameters:

df – Data to project.
nf – Number of components to keep. default: ‘all’
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data.

Return type:

model

saiph.inverse_transform(coord: DataFrame, model: Model, *, use_approximate_inverse: bool = False, use_max_modalities: bool = True, seed: int | None = None) → DataFrame

Return original format dataframe from coordinates.

Parameters:

coord – coord of individuals to reverse transform
model – model used for projection
use_approximate_inverse – matrix is not invertible when n_individuals < n_dimensions an approximation with bias can be done by setting to True. default: False
use_max_modalities – for each variable, it assigns to the individual the modality with the highest proportion (True) or a random modality weighted by their proportion (False). default: True
seed – seed to fix randomness if use_max_modalities = False. default: None

Returns:

coordinates transformed into original space.: Retains shape, encoding and structure.

Return type:

inverse

saiph.stats(model: Model, df: DataFrame, explode: bool = False) → Model

Compute the contributions and cos2.

Parameters:

model – Model computed by fit.
df – original dataframe
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False). Only valid for categorical variables.

Returns:

model populated with contribution.

Return type:

model

saiph.transform(df: DataFrame, model: Model, *, sparse: bool = False) → DataFrame

Scale and project into the fitted numerical space.

Parameters:

df – DataFrame to transform.
model – Model computed by fit.

Returns:

Coordinates of the dataframe in the fitted space.

Return type:

coord

saiph.models

class saiph.models.Model(dummy_categorical: List[str], original_dtypes: pandas.core.series.Series, original_categorical: List[str], original_continuous: List[str], nf: int, column_weights: numpy.ndarray[Any, numpy.dtype[numpy.float64]], row_weights: numpy.ndarray[Any, numpy.dtype[numpy.float64]], explained_var: numpy.ndarray[Any, numpy.dtype[numpy.float64]], explained_var_ratio: numpy.ndarray[Any, numpy.dtype[numpy.float64]], variable_coord: pandas.core.frame.DataFrame, V: numpy.ndarray[Any, numpy.dtype[numpy.float64]], modalities_types: Dict[str, str], U: numpy.ndarray[Any, numpy.dtype[numpy.float64]], s: numpy.ndarray[Any, numpy.dtype[numpy.float64]] | None = None, mean: pandas.core.series.Series | None = None, std: pandas.core.series.Series | None = None, prop: pandas.core.series.Series | None = None, _modalities: numpy.ndarray[Any, numpy.dtype[numpy.bytes_]] | None = None, D_c: numpy.ndarray[Any, numpy.dtype[numpy.float64]] | None = None, type: str | None = None, is_fitted: bool = False, correlations: pandas.core.frame.DataFrame | None = None, contributions: pandas.core.frame.DataFrame | None = None, cos2: pandas.core.frame.DataFrame | None = None, dummies_col_prop: numpy.ndarray[Any, numpy.dtype[numpy.float64]] | None = None)

Bases: object

D_c: ndarray[Any, dtype[float64]] | None = None

U: ndarray[Any, dtype[float64]]

V: ndarray[Any, dtype[float64]]

column_weights: ndarray[Any, dtype[float64]]

contributions: DataFrame | None = None

correlations: DataFrame | None = None

cos2: DataFrame | None = None

dummies_col_prop: ndarray[Any, dtype[float64]] | None = None

dummy_categorical: List[str]

explained_var: ndarray[Any, dtype[float64]]

explained_var_ratio: ndarray[Any, dtype[float64]]

is_fitted: bool = False

mean: Series | None = None

modalities_types: Dict[str, str]

nf: int

original_categorical: List[str]

original_continuous: List[str]

original_dtypes: Series

prop: Series | None = None

row_weights: ndarray[Any, dtype[float64]]

s: ndarray[Any, dtype[float64]] | None = None

std: Series | None = None

type: str | None = None

variable_coord: DataFrame

saiph.famd

FAMD projection module.

saiph.reduction.famd.center(df: DataFrame, quanti: List[str], quali: List[str]) → Tuple[DataFrame, ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]]

Center data, scale it, compute modalities and proportions of each categorical.

Used as internal function during fit.

NB: saiph.reduction.famd.scaler is better suited when a Model is already fitted.

Parameters:

df – DataFrame to center.
quanti – Indices of continuous variables.
quali – Indices of categorical variables.

Returns:

The scaled DataFrame. mean: Mean of the input dataframe. std: Standard deviation of the input dataframe. prop: Proportion of each categorical. _modalities: Modalities for the MCA.

Return type:

df_scale

saiph.reduction.famd.compute_categorical_cos2(model: Model, df: DataFrame, min_nf: int) → DataFrame

Compute the cos2 statistic for categorical variables.

Parameters:

model – model
df – dataframe
min_nf – number of degrees of freedom

Return type:

dataframe of categorical cos2

saiph.reduction.famd.compute_continuous_cos2(model: Model, scaled_df: DataFrame, min_nf: int, s: ndarray[Any, dtype[float64]], U: ndarray[Any, dtype[float64]]) → DataFrame

saiph.reduction.famd.fit(df: ~pandas.core.frame.DataFrame, nf: int | None = None, col_weights: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]] | None = None, center: ~typing.Callable[[~pandas.core.frame.DataFrame, ~typing.List[str], ~typing.List[str]], ~typing.Tuple[~pandas.core.frame.DataFrame, ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], ~numpy.ndarray[~typing.Any, ~numpy.dtype[~typing.Any]], ~numpy.ndarray[~typing.Any, ~numpy.dtype[~typing.Any]]]] = <function center>, seed: int | None = None) → Model

Fit a FAMD model on data.

Parameters:

df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data.

Return type:

model

saiph.reduction.famd.fit_transform(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) → Tuple[DataFrame, Model]

Fit a FAMD model on data and return transformed data.

Parameters:

df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The transformed data. model: The model for transforming new data.

Return type:

coord

saiph.reduction.famd.get_variable_contributions(model: Model, df: DataFrame, explode: bool = False) → Tuple[DataFrame, DataFrame]

Compute the contributions of the df variables within the fitted space.

Parameters:

model – Model computed by fit.
df – dataframe to compute contributions from
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)

Returns:

tuple of contributions and cos2.

saiph.reduction.famd.scaler(model: Model, df: DataFrame) → DataFrame

Scale data using mean, std, modalities and proportions of each categorical from model.

Parameters:

model – Model computed by fit.
df – DataFrame to scale.

Returns:

The scaled DataFrame.

Return type:

df_scaled

saiph.reduction.famd.stats(model: Model, df: DataFrame, explode: bool = False) → Model

Compute contributions and cos2.

Parameters:

model – Model computed by fit.
df – dataframe to compute statistics from
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)

Returns:

model populated with contribution and cos2.

Return type:

model

saiph.reduction.famd.transform(df: ~pandas.core.frame.DataFrame, model: ~saiph.models.Model, *, scaler: ~typing.Callable[[~saiph.models.Model, ~pandas.core.frame.DataFrame], ~pandas.core.frame.DataFrame] = <function scaler>) → DataFrame

Scale and project into the fitted numerical space.

Parameters:

df – DataFrame to transform.
model – Model computed by fit.

Returns:

Coordinates of the dataframe in the fitted space.

Return type:

coord

saiph.mca

MCA projection module.

saiph.reduction.mca.center(df: DataFrame) → Tuple[DataFrame, ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]]

Center data and compute modalities.

Used as internal function during fit.

NB: saiph.reduction.mca.scaler is better suited when a Model is already fitted.

Parameters:: df – DataFrame to center.
Returns:: The centered DataFrame. _modalities: Modalities for the MCA row_sum: Sums line by line column_sum: Sums column by column
Return type:: df_centered

saiph.reduction.mca.fit(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) → Model

Fit a MCA model on data.

Parameters:

df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data.

Return type:

model

saiph.reduction.mca.fit_transform(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) → Tuple[DataFrame, Model]

Fit a MCA model on data and return transformed data.

Parameters:

df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data. coord: The transformed data.

Return type:

model

saiph.reduction.mca.get_variable_contributions(model: Model, df: DataFrame, explode: bool = False) → DataFrame

Compute the contributions of the df variables within the fitted space.

Parameters:

model – Model computed by fit.
df – dataframe to compute contributions from
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)

Returns:

contributions

saiph.reduction.mca.scaler(model: Model, df: DataFrame) → DataFrame

Scale data using modalities from model.

Parameters:

model – Model computed by fit.
df – DataFrame to scale.

Returns:

The scaled DataFrame.

Return type:

df_scaled

saiph.reduction.mca.stats(model: Model, df: DataFrame, explode: bool = False) → Model

Compute the contributions.

Parameters:

model – Model computed by fit.
df – dataframe to compute contributions from in the original space
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)

Returns:

model.

saiph.reduction.mca.transform(df: DataFrame, model: Model) → DataFrame

Scale and project into the fitted numerical space.

Parameters:

df – DataFrame to transform.
model – Model computed by fit.

Returns:

Coordinates of the dataframe in the fitted space.

Return type:

coord

saiph.pca

PCA projection module.

saiph.reduction.pca.center(df: DataFrame) → Tuple[DataFrame, Series, Series]

Center data and standardize it if scale. Compute mean and std values.

Used as internal function during fit.

NB: saiph.reduction.pca.scaler is better suited when a Model is already fitted.

Parameters:: df – DataFrame to center.
Returns:: The centered DataFrame. mean: Mean of the input dataframe. std: Standard deviation of the input dataframe.
Return type:: df

saiph.reduction.pca.fit(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) → Model

Fit a PCA model on data.

Parameters:

df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data.

Return type:

model

saiph.reduction.pca.fit_transform(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) → Tuple[DataFrame, Model]

Fit a PCA model on data and return transformed data.

Parameters:

df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])

Returns:

The model for transforming new data. coord: The transformed data.

Return type:

model

saiph.reduction.pca.scaler(model: Model, df: DataFrame) → DataFrame

Scale data using mean and std from model.

Parameters:

model – Model computed by fit.
df – DataFrame to scale.

Returns:

The scaled DataFrame.

Return type:

df

saiph.reduction.pca.transform(df: DataFrame, model: Model) → DataFrame

Scale and project into the fitted numerical space.

Parameters:

df – DataFrame to transform.
model – Model computed by fit.

Returns:

Coordinates of the dataframe in the fitted space.

Return type:

coord

saiph.svd

saiph.reduction.utils.svd.get_direct_randomized_svd(A: ndarray[Any, dtype[float64]], l_retained_dimensions: int, q: int = 2, seed: int | None = None) → Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]]]

Compute a fixed-rank SVD approximation using random methods.

The computation of the randomized SVD is generally faster than a regular SVD when we retain a smaller number of dimensions than the dimension of the matrix.

From https://arxiv.org/abs/0909.4061, algorithm 5.1 page 29 (Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A.)

Parameters:

A (input matrix, shape (m, n)) –
l_retained_dimensions (target number of retained dimensions, l<min(m,n)) –
q (exponent of the power method. Higher this exponent, the more precise will be) –
SVD (the) –
compute. (but more complex to) –
seed (random seed. Default None) –

Returns:

U (unitary matrix having left singular vectors as columns, shape (m,l))
S (vector of the singular values, shape (l,))
Vt (unitary matrix having right singular vectors as rows, shape (l,n))

saiph.reduction.utils.svd.get_randomized_subspace_iteration(A: ndarray[Any, dtype[float64]], l_retained_dimensions: int, *, q: int = 2, seed: int | None = None) → ndarray[Any, dtype[float64]]

Generate a subspace for more efficient SVD computation using random methods.

From https://arxiv.org/abs/0909.4061, algorithm 4.4 page 27 (Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A.)

Parameters:

A (input matrix, shape (m, n)) –
l_retained_dimensions (target number of retained dimensions, l<min(m,n)) –
q (exponent of the power method. The higher this exponent, the more precise will be) – the SVD, but more complex to compute. Default 2
seed (random seed. Default None) –

Returns:

Q

Return type:

matrix whose range approximates the range of A, shape (m, l)

saiph.reduction.utils.svd.get_svd(df: DataFrame, nf: int | None = None, *, svd_flip: bool = True, seed: int | None = None) → Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]]]

Compute Singular Value Decomposition.

Parameters:

df (Matrix to decompose, shape (m, n)) –
nf (target number of dimensions to retain (number of singular values). Default None.) – It keeps the nf higher singular values and nf associated singular vectors.
svd_flip (Whether to use svd_flip on U and V or not. Default True) –
seed (random seed. Default None) –

Returns:

U (unitary matrix having left singular vectors as columns, shape (m,l))
S (vector of the singular values, shape (l,))
Vt (unitary matrix having right singular vectors as rows, shape (l,n))

saiph.visualization

Visualization functions.

saiph.visualization.plot_circle(model: Model, dimensions: List[int] | None = None, min_cor: float = 0.1, max_var: int = 7) → None

Plot correlation circle.

Parameters:

model – The model for transforming new data.
dimensions – Dimensions to help by each axis
min_cor – Minimum correlation threshold to display arrow. default: 0.1
max_var – Number of variables to display (in descending order). default: 7

saiph.visualization.plot_explained_var(model: Model, max_dims: int = 10, cumulative: bool = False) → None

Plot explained variance per dimension.

Parameters:

model – Model computed by fit.
max_dims – Maximum number of dimensions to plot

saiph.visualization.plot_projections(model: Model, data: DataFrame, dim: Tuple[int, int] = (0, 1)) → None

Plot projections in reduced space for input data.

Parameters:

model – Model computed by fit.
data – Data to plot in the reduced space
dim – Axes to use for the 2D plot (default (0,1))

saiph.visualization.plot_var_contribution(values: ndarray[Any, dtype[float64]], names: ndarray[Any, dtype[bytes_]], title: str = 'Variables contributions') → None: Plot the variable contributions for a given dimension.