topologic.embedding.clustering package¶

topologic.embedding.clustering.dbscan(embedding: numpy.ndarray, eps: float = 0.5, min_samples: int = 5, metric: str = 'minkowski', metric_params: dict = None, algorithm: str = 'auto', leaf_size: int = 30, p: float = 2, sample_weight: array.array = None, n_jobs: int = None) → numpy.ndarray[source]¶

Perform DBSCAN clustering from vector array or distance matrix.

Parameters

embedding (numpy.ndarray) – An n x d array of vectors representing n labels in a d dimensional space
eps (Optional[float]) – The maximum distance between two samples for them to be considered as in the same neighborhood.
min_samples (Optional[int]) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
metric (Union[str, Callable[[float, float], float]]) –
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances() for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.
metric_params (Optional[dict]) – Additional keyword arguments for the metric function.
algorithm (Optional[str]) – The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. Potential values: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
leaf_size (Optional[int]) – Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. Default 30
p (Optional[float]) – The power of the Minkowski metric to be used to calculate distance between points. Default 2.0
sample_weight (Optional[Array[int]]) – Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
n_jobs (Optional[int]) – The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Returns

The cluster labels for each vector in the given embedding. The vector at index n in the embedding will have the label at index n in this returned array. Noisy samples are given the value -1

Return type

np.ndarray

topologic.embedding.clustering.gaussian_mixture_model(embedding: numpy.ndarray, num_clusters: int = 1, seed: int = None) → numpy.ndarray[source]¶

Performs gaussian mixture model clustering on the feature_matrix.

Parameters

embedding (numpy.ndarray) – An n x d feature matrix; it is assumed that the d features are ordered
num_clusters (int) – How many clusters to look at between min_clusters and max_clusters, default 1
seed (Optional[int]) – The seed for numpy random, default None

Returns

The cluster labels for each vector in the given embedding. The vector at index n in the embedding will have the label at index n in this returned array

Return type

np.ndarray

topologic.embedding.clustering.kmeans(embedding: numpy.ndarray, n_clusters: int = 1, init: Union[str, numpy.ndarray] = 'k-means++', n_init: int = 10, max_iter: int = 300, tolerance: float = 0.0001, precompute_distances='auto', verbose: int = 0, random_state: int = None, copy_x: bool = True, n_jobs: int = None, algorithm: str = 'auto') → numpy.ndarray[source]¶

Performs kmeans clustering on the embedding.

Parameters

embedding (numpy.ndarray) – An n x d array of vectors representing n labels in a d dimensional space
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate. Default 1
init (Union[str, numpy.ndarray]) –
Method for initialization, defaults to ‘k-means++’:

’k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

’random’: choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
n_init (int) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Default 10
max_iter (int) – Maximum number of iterations of the k-means algorithm for a single run. Default 300
tolerance (float) – Relative tolerance with regards to inertia to declare convergence. Default 1e-4
precompute_distances (Union[bool, str]) –
Precompute distances (faster but takes more memory).

’auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

True : always precompute distances

False : never precompute distances
verbose (int) – Verbosity mode. Default 0
random_state (Optional[Union[int, numpy.random.RandomState]]) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
copy_x (Optional[bool]) – When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified, ensuring X is C-contiguous. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean, in this case it will also not ensure that data is C-contiguous which may cause a significant slowdown.
n_jobs (Optional[int]) –
The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
algorithm (str) – K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.

Returns

The cluster labels for each vector in the given embedding. The vector at index n in the embedding will have the label at index n in this returned array

Return type

numpy.ndarray

topologic.embedding.clustering.wards_clustering(embedding: numpy.ndarray, num_clusters: int = 2, affinity: str = 'euclidean', memory: str = None, connectivity: numpy.ndarray = None, compute_full_tree: str = 'auto') → numpy.ndarray[source]¶

Uses agglomerative clustering with ward linkage

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Parameters

embedding (numpy.ndarray) – An n x d array of vectors representing n labels in a d dimensional space
num_clusters (int) – int, default=2 The number of clusters to find.
affinity (str) – string or callable, default: “euclidean” Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or ‘precomputed’. If linkage is “ward”, only “euclidean” is accepted.
memory (Optional[Union[str, joblib.Memory]]) – None, str or object with the joblib.Memory interface, optional Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
connectivity (numpy.ndarray) – array-like or callable, optional Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.
compute_full_tree (Optional[str]) – bool or ‘auto’ (optional) Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree.

Returns

The cluster labels for each vector in the given embedding. The vector at index n in the embedding will have the label at index n in this returned array

Return type

np.ndarray