API Reference
saiph
- saiph.fit(df: DataFrame, nf: int | None = None, col_weights: Dict[str, int | float] | None = None, sparse: bool = False) Model
Fit a PCA, MCA or FAMD model on data, imputing what has to be used.
Datetimes must be stored as numbers of seconds since epoch.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: None, which uses all columns.
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data.
- Return type:
model
- saiph.fit_transform(df: DataFrame, nf: int | None = None, col_weights: Dict[str, int | float] | None = None) Tuple[DataFrame, Model]
Fit a PCA, MCA or FAMD model on data, imputing what has to be used.
Datetimes must be stored as numbers of seconds since epoch.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: ‘all’
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data.
- Return type:
model
- saiph.inverse_transform(coord: DataFrame, model: Model, *, use_approximate_inverse: bool = False, use_max_modalities: bool = True, seed: int | None = None) DataFrame
Return original format dataframe from coordinates.
- Parameters:
coord – coord of individuals to reverse transform
model – model used for projection
use_approximate_inverse – matrix is not invertible when n_individuals < n_dimensions an approximation with bias can be done by setting to
True. default:Falseuse_max_modalities – for each variable, it assigns to the individual the modality with the highest proportion (True) or a random modality weighted by their proportion (False). default: True
seed – seed to fix randomness if use_max_modalities = False. default: None
- Returns:
- coordinates transformed into original space.
Retains shape, encoding and structure.
- Return type:
inverse
- saiph.stats(model: Model, df: DataFrame, explode: bool = False) Model
Compute the contributions and cos2.
- Parameters:
model – Model computed by fit.
df – original dataframe
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False). Only valid for categorical variables.
- Returns:
model populated with contribution.
- Return type:
model
saiph.models
- class saiph.models.Model(dummy_categorical: List[str], original_dtypes: pandas.core.series.Series, original_categorical: List[str], original_continuous: List[str], nf: int, column_weights: numpy.ndarray[Any, numpy.dtype[numpy.float64]], row_weights: numpy.ndarray[Any, numpy.dtype[numpy.float64]], explained_var: numpy.ndarray[Any, numpy.dtype[numpy.float64]], explained_var_ratio: numpy.ndarray[Any, numpy.dtype[numpy.float64]], variable_coord: pandas.core.frame.DataFrame, V: numpy.ndarray[Any, numpy.dtype[numpy.float64]], modalities_types: Dict[str, str], U: numpy.ndarray[Any, numpy.dtype[numpy.float64]], s: numpy.ndarray[Any, numpy.dtype[numpy.float64]] | None = None, mean: pandas.core.series.Series | None = None, std: pandas.core.series.Series | None = None, prop: pandas.core.series.Series | None = None, _modalities: numpy.ndarray[Any, numpy.dtype[numpy.bytes_]] | None = None, D_c: numpy.ndarray[Any, numpy.dtype[numpy.float64]] | None = None, type: str | None = None, is_fitted: bool = False, correlations: pandas.core.frame.DataFrame | None = None, contributions: pandas.core.frame.DataFrame | None = None, cos2: pandas.core.frame.DataFrame | None = None, dummies_col_prop: numpy.ndarray[Any, numpy.dtype[numpy.float64]] | None = None)
Bases:
object- D_c: ndarray[Any, dtype[float64]] | None = None
- U: ndarray[Any, dtype[float64]]
- V: ndarray[Any, dtype[float64]]
- column_weights: ndarray[Any, dtype[float64]]
- contributions: DataFrame | None = None
- correlations: DataFrame | None = None
- cos2: DataFrame | None = None
- dummies_col_prop: ndarray[Any, dtype[float64]] | None = None
- dummy_categorical: List[str]
- explained_var: ndarray[Any, dtype[float64]]
- explained_var_ratio: ndarray[Any, dtype[float64]]
- is_fitted: bool = False
- mean: Series | None = None
- modalities_types: Dict[str, str]
- nf: int
- original_categorical: List[str]
- original_continuous: List[str]
- original_dtypes: Series
- prop: Series | None = None
- row_weights: ndarray[Any, dtype[float64]]
- s: ndarray[Any, dtype[float64]] | None = None
- std: Series | None = None
- type: str | None = None
- variable_coord: DataFrame
saiph.famd
FAMD projection module.
- saiph.reduction.famd.center(df: DataFrame, quanti: List[str], quali: List[str]) Tuple[DataFrame, ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]]
Center data, scale it, compute modalities and proportions of each categorical.
Used as internal function during fit.
NB: saiph.reduction.famd.scaler is better suited when a Model is already fitted.
- Parameters:
df – DataFrame to center.
quanti – Indices of continuous variables.
quali – Indices of categorical variables.
- Returns:
The scaled DataFrame. mean: Mean of the input dataframe. std: Standard deviation of the input dataframe. prop: Proportion of each categorical. _modalities: Modalities for the MCA.
- Return type:
df_scale
- saiph.reduction.famd.compute_categorical_cos2(model: Model, df: DataFrame, min_nf: int) DataFrame
Compute the cos2 statistic for categorical variables.
- Parameters:
model – model
df – dataframe
min_nf – number of degrees of freedom
- Return type:
dataframe of categorical cos2
- saiph.reduction.famd.compute_continuous_cos2(model: Model, scaled_df: DataFrame, min_nf: int, s: ndarray[Any, dtype[float64]], U: ndarray[Any, dtype[float64]]) DataFrame
- saiph.reduction.famd.fit(df: ~pandas.core.frame.DataFrame, nf: int | None = None, col_weights: ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]] | None = None, center: ~typing.Callable[[~pandas.core.frame.DataFrame, ~typing.List[str], ~typing.List[str]], ~typing.Tuple[~pandas.core.frame.DataFrame, ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy.float64]], ~numpy.ndarray[~typing.Any, ~numpy.dtype[~typing.Any]], ~numpy.ndarray[~typing.Any, ~numpy.dtype[~typing.Any]]]] = <function center>, seed: int | None = None) Model
Fit a FAMD model on data.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data.
- Return type:
model
- saiph.reduction.famd.fit_transform(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) Tuple[DataFrame, Model]
Fit a FAMD model on data and return transformed data.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The transformed data. model: The model for transforming new data.
- Return type:
coord
- saiph.reduction.famd.get_variable_contributions(model: Model, df: DataFrame, explode: bool = False) Tuple[DataFrame, DataFrame]
Compute the contributions of the df variables within the fitted space.
- Parameters:
model – Model computed by fit.
df – dataframe to compute contributions from
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)
- Returns:
tuple of contributions and cos2.
- saiph.reduction.famd.scaler(model: Model, df: DataFrame) DataFrame
Scale data using mean, std, modalities and proportions of each categorical from model.
- Parameters:
model – Model computed by fit.
df – DataFrame to scale.
- Returns:
The scaled DataFrame.
- Return type:
df_scaled
- saiph.reduction.famd.stats(model: Model, df: DataFrame, explode: bool = False) Model
Compute contributions and cos2.
- Parameters:
model – Model computed by fit.
df – dataframe to compute statistics from
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)
- Returns:
model populated with contribution and cos2.
- Return type:
model
- saiph.reduction.famd.transform(df: ~pandas.core.frame.DataFrame, model: ~saiph.models.Model, *, scaler: ~typing.Callable[[~saiph.models.Model, ~pandas.core.frame.DataFrame], ~pandas.core.frame.DataFrame] = <function scaler>) DataFrame
Scale and project into the fitted numerical space.
- Parameters:
df – DataFrame to transform.
model – Model computed by fit.
- Returns:
Coordinates of the dataframe in the fitted space.
- Return type:
coord
saiph.mca
MCA projection module.
- saiph.reduction.mca.center(df: DataFrame) Tuple[DataFrame, ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]]]
Center data and compute modalities.
Used as internal function during fit.
NB: saiph.reduction.mca.scaler is better suited when a Model is already fitted.
- Parameters:
df – DataFrame to center.
- Returns:
The centered DataFrame. _modalities: Modalities for the MCA row_sum: Sums line by line column_sum: Sums column by column
- Return type:
df_centered
- saiph.reduction.mca.fit(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) Model
Fit a MCA model on data.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data.
- Return type:
model
- saiph.reduction.mca.fit_transform(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) Tuple[DataFrame, Model]
Fit a MCA model on data and return transformed data.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data. coord: The transformed data.
- Return type:
model
- saiph.reduction.mca.get_variable_contributions(model: Model, df: DataFrame, explode: bool = False) DataFrame
Compute the contributions of the df variables within the fitted space.
- Parameters:
model – Model computed by fit.
df – dataframe to compute contributions from
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)
- Returns:
contributions
- saiph.reduction.mca.scaler(model: Model, df: DataFrame) DataFrame
Scale data using modalities from model.
- Parameters:
model – Model computed by fit.
df – DataFrame to scale.
- Returns:
The scaled DataFrame.
- Return type:
df_scaled
- saiph.reduction.mca.stats(model: Model, df: DataFrame, explode: bool = False) Model
Compute the contributions.
- Parameters:
model – Model computed by fit.
df – dataframe to compute contributions from in the original space
explode – whether to split the contributions of each modality (True) or sum them as the contribution of the whole variable (False)
- Returns:
model.
saiph.pca
PCA projection module.
- saiph.reduction.pca.center(df: DataFrame) Tuple[DataFrame, Series, Series]
Center data and standardize it if scale. Compute mean and std values.
Used as internal function during fit.
NB: saiph.reduction.pca.scaler is better suited when a Model is already fitted.
- Parameters:
df – DataFrame to center.
- Returns:
The centered DataFrame. mean: Mean of the input dataframe. std: Standard deviation of the input dataframe.
- Return type:
df
- saiph.reduction.pca.fit(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) Model
Fit a PCA model on data.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data.
- Return type:
model
- saiph.reduction.pca.fit_transform(df: DataFrame, nf: int | None = None, col_weights: ndarray[Any, dtype[float64]] | None = None, seed: int | None = None) Tuple[DataFrame, Model]
Fit a PCA model on data and return transformed data.
- Parameters:
df – Data to project.
nf – Number of components to keep. default: min(df.shape)
col_weights – Weight assigned to each variable in the projection (more weight = more importance in the axes). default: np.ones(df.shape[1])
- Returns:
The model for transforming new data. coord: The transformed data.
- Return type:
model
saiph.svd
- saiph.reduction.utils.svd.get_direct_randomized_svd(A: ndarray[Any, dtype[float64]], l_retained_dimensions: int, q: int = 2, seed: int | None = None) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]]]
Compute a fixed-rank SVD approximation using random methods.
The computation of the randomized SVD is generally faster than a regular SVD when we retain a smaller number of dimensions than the dimension of the matrix.
From https://arxiv.org/abs/0909.4061, algorithm 5.1 page 29 (Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A.)
- Parameters:
A (input matrix, shape (m, n)) –
l_retained_dimensions (target number of retained dimensions, l<min(m,n)) –
q (exponent of the power method. Higher this exponent, the more precise will be) –
SVD (the) –
compute. (but more complex to) –
seed (random seed. Default None) –
- Returns:
U (unitary matrix having left singular vectors as columns, shape (m,l))
S (vector of the singular values, shape (l,))
Vt (unitary matrix having right singular vectors as rows, shape (l,n))
- saiph.reduction.utils.svd.get_randomized_subspace_iteration(A: ndarray[Any, dtype[float64]], l_retained_dimensions: int, *, q: int = 2, seed: int | None = None) ndarray[Any, dtype[float64]]
Generate a subspace for more efficient SVD computation using random methods.
From https://arxiv.org/abs/0909.4061, algorithm 4.4 page 27 (Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A.)
- Parameters:
A (input matrix, shape (m, n)) –
l_retained_dimensions (target number of retained dimensions, l<min(m,n)) –
q (exponent of the power method. The higher this exponent, the more precise will be) – the SVD, but more complex to compute. Default 2
seed (random seed. Default None) –
- Returns:
Q
- Return type:
matrix whose range approximates the range of A, shape (m, l)
- saiph.reduction.utils.svd.get_svd(df: DataFrame, nf: int | None = None, *, svd_flip: bool = True, seed: int | None = None) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]]]
Compute Singular Value Decomposition.
- Parameters:
df (Matrix to decompose, shape (m, n)) –
nf (target number of dimensions to retain (number of singular values). Default None.) – It keeps the nf higher singular values and nf associated singular vectors.
svd_flip (Whether to use svd_flip on U and V or not. Default True) –
seed (random seed. Default None) –
- Returns:
U (unitary matrix having left singular vectors as columns, shape (m,l))
S (vector of the singular values, shape (l,))
Vt (unitary matrix having right singular vectors as rows, shape (l,n))
saiph.visualization
Visualization functions.
- saiph.visualization.plot_circle(model: Model, dimensions: List[int] | None = None, min_cor: float = 0.1, max_var: int = 7) None
Plot correlation circle.
- Parameters:
model – The model for transforming new data.
dimensions – Dimensions to help by each axis
min_cor – Minimum correlation threshold to display arrow. default: 0.1
max_var – Number of variables to display (in descending order). default: 7
- saiph.visualization.plot_explained_var(model: Model, max_dims: int = 10, cumulative: bool = False) None
Plot explained variance per dimension.
- Parameters:
model – Model computed by fit.
max_dims – Maximum number of dimensions to plot
- saiph.visualization.plot_projections(model: Model, data: DataFrame, dim: Tuple[int, int] = (0, 1)) None
Plot projections in reduced space for input data.
- Parameters:
model – Model computed by fit.
data – Data to plot in the reduced space
dim – Axes to use for the 2D plot (default (0,1))
- saiph.visualization.plot_var_contribution(values: ndarray[Any, dtype[float64]], names: ndarray[Any, dtype[bytes_]], title: str = 'Variables contributions') None
Plot the variable contributions for a given dimension.