topologic.embedding package

topologic.embedding.adjacency_embedding(graph: networkx.classes.graph.Graph, maximum_dimensions: int = 100, elbow_cut: Optional[int] = 1, weight_column: str = 'weight', svd_seed: Optional[int] = None, num_iterations: int = 5, power_iteration_normalizer: str = 'QR', num_oversamples: int = 10) → topologic.embedding.embedding_container.EmbeddingContainer[source]

Generates a spectral embedding based upon the adjacency matrix of the graph.

See also: https://csustan.csustan.edu/~tom/Clustering/GraphLaplacian-tutorial.pdf

Parameters:
  • graph (networkx.Graph) – graph_augmented_sparse_matrix networkx Graph object containing no more than one connected component. Note that if the graph is a directed graph, the resulting dimensionality of the embedding will be twice that of an undirected graph
  • maximum_dimensions (int) – Maximum dimensions of embeddings that will be returned - defaults to 100. Actual dimensions of resulting embeddings should be significantly smaller, but will never be over this value.
  • elbow_cut (Optional[int]) – scree plot elbow detection will detect (usually) many elbows. This value specifies which elbow to use prior to filtering out extraneous dimensions. If None, then an embedding of size maximum_dimensions will be returned.
  • weight_column (str) – The weight column to use in the Graph.
  • svd_seed (Optional[int]) – If not provided, uses a random number every time, making consistent results difficult Set this to a random int if you want consistency between executions over the same graph.
  • num_iterations (int) – The number of iterations to be used in the svd solver.
  • num_oversamples (int) – Additional number of random vectors to sample the range of M so as to ensure proper conditioning. The total number of random vectors used to find the range of M is n_components + n_oversamples. Smaller number can improve speed but can negatively impact the quality of approximation of singular vectors and singular values.
  • power_iteration_normalizer (Optional[str]) –

    Whether the power iterations are normalized with step-by-step QR factorization (the slowest but most accurate), ‘none’ (the fastest but numerically unstable when n_iter is large, e.g. typically 5 or larger), or ‘LU’ factorization (numerically stable but can lose slightly in accuracy). The ‘auto’ mode applies no normalization if num_iterations <= 2 and switches to LU otherwise.

    Options: ‘auto’ (default), ‘QR’, ‘LU’, ‘none’

Returns:

EmbeddingContainer containing a matrix, which itself contains the embedding for each node. the tuple also contains a vector containing the corresponding vertex labels for each row in the matrix. the matrix and vector are positionally correlated.

Return type:

EmbeddingContainer

class topologic.embedding.EmbeddingContainer(embedding, vertex_labels)[source]

Bases: tuple

embedding

Alias for field number 0

to_dictionary()[source]
vertex_labels

Alias for field number 1

class topologic.embedding.EmbeddingMethod[source]

Bases: enum.Enum

An enum to represent which embedding method to use when generating an Omnibus embedding

ADJACENCY_SPECTRAL_EMBEDDING = 0
LAPLACIAN_SPECTRAL_EMBEDDING = 1
topologic.embedding.find_elbows(iterable_to_search: Union[list, numpy.array], num_elbows: int = 1, threshold: int = 0) → numpy.array[source]

An implementation of profile likelihood as outlined in Zhu and Ghodsi References, Zhu, Mu and Ghodsi, Ali (2006), Automatic dimensionality selection from the scree plot via the use of profile likelihood, Computational Statistics & Data Analysis, Volume 51 Issue 2, pp 918-930, November, 2006

Examples:
>>> input_data = [2, 3, 4, 5, 6, 7, 8, 9]
>>> result: np.array = find_elbows(input_data, num_elbows=1, threshold=0)
>>> result.size
1
Parameters:
  • iterable_to_search – An ordered or unordered list of values that will be used to find the elbows.
  • num_elbows – The number of elbows to return
  • threshold – Smallest value to consider. Nonzero thresholds will affect elbow selection.
Returns:

A numpy array containing elbows

topologic.embedding.generate_omnibus_matrix(matrices: List[Union[numpy.ndarray, scipy.sparse.csr.csr_matrix]]) → numpy.ndarray[source]

Generate the omnibus matrix from a list of adjacency or laplacian matrices as described by ‘A central limit theorem for an omnibus embedding of random dot product graphs.’

Given an iterable of matrices a, b, … n then the omnibus matrix is defined as:

[[           a, .5 * (a + b), ..., .5 * (a + n)],
 [.5 * (b + a),            b, ..., .5 * (b + n)],
 [         ...,          ..., ...,          ...],
 [.5 * (n + a),  .5 * (n + b, ...,            n]
]

The current iteration of this function operates in O(n) but a further optimization could take it to O(.5 * n)

See also:
The original paper - https://arxiv.org/abs/1705.09355
Parameters:matrices (List[Union[numpy.ndarray, scipy.sparse.csr_matrix]]) – The list of matrices to generate the Omnibus matrix
Returns:An Omnibus matrix
topologic.embedding.laplacian_embedding(graph: networkx.classes.graph.Graph, maximum_dimensions: int = 100, elbow_cut: Optional[int] = 1, weight_column: str = 'weight', svd_seed: Optional[int] = None, num_iterations: int = 5, power_iteration_normalizer: str = 'QR', num_oversamples: int = 10) → topologic.embedding.embedding_container.EmbeddingContainer[source]

Generates a spectral embedding based upon the Laplacian matrix of the graph.

See also: https://csustan.csustan.edu/~tom/Clustering/GraphLaplacian-tutorial.pdf

Parameters:
  • graph (networkx.Graph) – A networkx Graph object containing no more than one connected component. Note that if the graph is a directed graph, the resulting dimensionality of the embedding will be twice that of an undirected graph
  • maximum_dimensions (int) – Maximum dimensions of embeddings that will be returned - defaults to 100. Actual dimensions of resulting embeddings should be significantly smaller, but will never be over this value.
  • elbow_cut (Optional[int]) – scree plot elbow detection will detect (usually) many elbows. This value specifies which elbow to use prior to filtering out extraneous dimensions. If None, then an embedding of size maximum_dimensions will be returned.
  • weight_column (str) – The weight column to use in the Graph.
  • svd_seed (Optional[int]) – If not provided, uses a random number every time, making consistent results difficult Set this to a random int if you want consistency between executions over the same graph.
  • num_iterations (int) – The number of iterations to be used in the svd solver.
  • num_oversamples (int) – Additional number of random vectors to sample the range of M so as to ensure proper conditioning. The total number of random vectors used to find the range of M is n_components + n_oversamples. Smaller number can improve speed but can negatively impact the quality of approximation of singular vectors and singular values.
  • power_iteration_normalizer (Optional[str]) –

    Whether the power iterations are normalized with step-by-step QR factorization (the slowest but most accurate), ‘none’ (the fastest but numerically unstable when n_iter is large, e.g. typically 5 or larger), or ‘LU’ factorization (numerically stable but can lose slightly in accuracy). The ‘auto’ mode applies no normalization if num_iterations <= 2 and switches to LU otherwise.

    Options: ‘auto’ (default), ‘QR’, ‘LU’, ‘none’

Returns:

EmbeddingContainer containing a matrix, which itself contains the embedding for each node. the tuple also contains a vector containing the corresponding vertex labels for each row in the matrix. the matrix and vector are positionally correlated.

Return type:

EmbeddingContainer

topologic.embedding.node2vec_embedding(graph: networkx.classes.graph.Graph, num_walks: int = 10, walk_length: int = 80, return_hyperparameter: int = 1, inout_hyperparameter: int = 1, dimensions: int = 128, window_size: int = 10, workers: int = 8, iterations: int = 1, interpolate_walk_lengths_by_node_degree: bool = True) → topologic.embedding.embedding_container.EmbeddingContainer[source]

Generates a node2vec embedding from a given graph. Will follow the word2vec algorithm to create the embedding.

Parameters:
  • graph (networkx.Graph) – A networkx graph. If the graph is unweighted, the weight of each edge will default to 1
  • num_walks (int) – Number of walks per source. Default is 10.
  • walk_length (int) – Length of walk per source. Default is 80.
  • return_hyperparameter (int) – Return hyperparameter (p). Default is 1.
  • inout_hyperparameter (int) – Inout hyperparameter (q). Default is 1.
  • dimensions (int) – Dimensionality of the word vectors. Default is 128.
  • window_size (int) – Maximum distance between the current and predicted word within a sentence. Default is 10.
  • workers (int) – Use these many worker threads to train the model. Default is 8.
  • iterations (int) – Number of epochs in stochastic gradient descent (SGD)
  • interpolate_walk_lengths_by_node_degree (bool) –

    Use a dynamic walk length that corresponds to each nodes degree. If the node is in the bottom 20 percentile, default to a walk length of 1. If it is in the top 10 percentile, use walk_length. If it is in the 20-80 percentiles, linearly interpolate between 1 and walk_length.

    This will reduce lower degree nodes from biasing your resulting embedding. If a low degree node has the same number of walks as a high degree node (which it will if this setting is not on), then the lower degree nodes will take a smaller breadth of random walks when compared to the high degree nodes. This will result in your lower degree walks dominating your higher degree nodes.

Returns:

tuple containing a matrix, which itself contains the embedding for each node. the tuple also contains a vector containing the corresponding vertex labels for each row in the matrix. the matrix and vector are positionally correlated.

Return type:

EmbeddingContainer

topologic.embedding.omnibus_embedding(graphs: List[networkx.classes.graph.Graph], maximum_dimensions: int = 100, elbow_cut: Optional[int] = 1, embedding_method: topologic.embedding.embedding_methods.EmbeddingMethod = <EmbeddingMethod.LAPLACIAN_SPECTRAL_EMBEDDING: 1>, svd_seed: Optional[int] = None, num_iterations: int = 5, power_iteration_normalizer: str = 'QR', num_oversamples: int = 10) → List[Tuple[topologic.embedding.embedding_container.EmbeddingContainer, topologic.embedding.embedding_container.EmbeddingContainer]][source]

Generates a pairwise omnibus embedding for each pair of graphs in a list of graphs. If given graphs A, B, and C, the embeddings will be computed for A,B and B,C.

There should be exactly the same number of nodes in each graph with exactly the same labels. The list of graphs should represent a time series and should be in an order such that time is continuous through the list of graphs.

If the labels differ between each pair of graphs, then those nodes will _not_ be found in the resulting embedding.

Parameters:
  • graphs (List[networkx.Graph]) – A list of graphs that will be used to generate the omnibus embedding. Each graph should have exactly the same vertices as each of the other graphs. The order of the graphs in the list matter. The first graph will be at time 0 and each following graph will increment time by 1.
  • maximum_dimensions (int) – Maximum dimensions of embeddings that will be returned - defaults to 100. Actual dimensions of resulting embeddings should be significantly smaller, but will never be over this value.
  • elbow_cut (int) – scree plot elbow detection will detect (usually) many elbows. This value specifies which elbow to use prior to filtering out extraneous dimensions.
  • embedding_method (topologic.embedding.EmbeddingMethod) – The embedding technique used to generate the Omnibus embedding.
  • svd_seed (Optional[int]) – If not provided, uses a random number every time, making consistent results difficult Set this to a random int if you want consistency between executions over the same graph.
  • num_iterations (int) – The number of iterations to be used in the svd solver.
  • num_oversamples (int) – Additional number of random vectors to sample the range of M so as to ensure proper conditioning. The total number of random vectors used to find the range of M is n_components + n_oversamples. Smaller number can improve speed but can negatively impact the quality of approximation of singular vectors and singular values.
  • power_iteration_normalizer (Optional[str]) –

    Whether the power iterations are normalized with step-by-step QR factorization (the slowest but most accurate), ‘none’ (the fastest but numerically unstable when n_iter is large, e.g. typically 5 or larger), or ‘LU’ factorization (numerically stable but can lose slightly in accuracy). The ‘auto’ mode applies no normalization if num_iterations <= 2 and switches to LU otherwise.

    Options: ‘auto’ (default), ‘QR’, ‘LU’, ‘none’

Returns:

A List of EmbeddingContainers each containing a matrix, which itself contains the embedding for each node. the tuple also contains a vector containing the corresponding vertex labels for each row in the matrix. the matrix and vector are positionally correlated.

Return type:

List[(EmbeddingContainer, EmbeddingContainer)]

class topologic.embedding.OutOfSampleEmbeddingContainer(embedding, vertex_labels, vertex_labels_failing_inference, starting_index_of_oos_embedding, u, sigma)[source]

Bases: tuple

embedding

Alias for field number 0

sigma

Alias for field number 5

starting_index_of_oos_embedding

Alias for field number 3

to_dictionary()[source]
u

Alias for field number 4

vertex_labels

Alias for field number 1

vertex_labels_failing_inference

Alias for field number 2

topologic.embedding.pca(embedding: numpy.ndarray, num_components: Union[str, int], whiten: bool = False, svd_solver: str = 'auto', tolerance: float = 0.0, iterated_power: Union[int, str] = 'auto', random_state: Union[int, numpy.random.mtrand.RandomState, None] = None) → numpy.ndarray[source]

Principal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

Parameters:
  • embedding (numpy.ndarray) – The embedding in which PCA will be applied
  • num_components (Union[str, int]) –

    If num_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of num_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.

    If 0 < num_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by num_components.

    If svd_solver == 'arpack', the number of components must be strictly less than the minimum of number of features and the number of samples.

  • whiten (bool) –

    When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

    Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

  • svd_solver (str) –
    auto :
    the solver is selected by a default policy based on X.shape and num_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
    full :
    run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
    arpack :
    run SVD truncated to num_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < num_components < min(X.shape)
    randomized :
    run randomized SVD by the method of Halko et al.
  • tolerance (float) – Tolerance for singular values computed by svd_solver == ‘arpack’. A float value >=0 with default 0
  • iterated_power (Union[int, str]) – Number of iterations for the power method computed by svd_solver == ‘randomized’.
  • random_state (Optional[int]) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.
Returns:

A np.ndarray of principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by variance`

Return type:

numpy.ndarray

topologic.embedding.sample_graph_by_edge_weight(graph, weight_column='weight', weight_cutoff=None, percentage=0.1, nodelist=None)[source]
topologic.embedding.sample_graph_by_vertex_degree(graph, degree_cutoff=None, percentage=0.1, nodelist=None)[source]
class topologic.embedding.SampleMethod[source]

Bases: enum.Enum

An enumeration.

EDGE_WEIGHT = 1
VERTEX_DEGREE = 0
topologic.embedding.tsne(embedding: numpy.ndarray, num_components: int = 2, perplexity: float = 30.0, early_exaggeration: float = 12.0, learning_rate: float = 200.0, num_iterations: int = 1000, num_iterations_without_progress: int = 300, min_grad_norm: float = 1e-07, metric: str = 'euclidean', init: str = 'random', verbose: int = 1, random_state: Union[int, numpy.random.mtrand.RandomState, None] = None, method: str = 'barnes_hut', angle: float = 0.5) → numpy.ndarray[source]

t-distributed Stochastic Neighbor Embedding.

t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.

Parameters:
  • embedding (numpy.ndarray) – The embedding in which PCA will be applied
  • num_components (int) – Dimension of the embedded space. Default 2
  • perplexity (float) – The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter. Default 30.0
  • early_exaggeration (float) – Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. Default 12.0
  • learning_rate (float) – The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help. Default 200.0
  • num_iterations (int) – Maximum number of iterations for the optimization. Should be at least 250. Default 1000
  • num_iterations_without_progress (int) – Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. Note that progress is only checked every 50 iterations so this value is rounded to the next multiple of 50. Default 300
  • min_grad_norm (float) – If the gradient norm is below this threshold, the optimization will be stopped. Default 1e-7
  • metric (Union[str, Callable]) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. The default is “euclidean” which is interpreted as squared euclidean distance. Default ‘euclidean’
  • init (Union[string, numpy.ndarray]) – Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy array of shape (n_samples, num_components). PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization. Default ‘random’
  • verbose (int) – Verbosity level. Default 1
  • random_state (Optional[Union[int, numpy.random.RandomState]]) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Note that different initializations might result in different local minima of the cost function.
  • method (str) – By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples. Default ‘barnes_hut’
  • angle (float) – Only used if method=’barnes_hut’ This is the trade-off between speed and accuracy for Barnpcaes-Hut T-SNE. ‘angle’ is the angular size (referred to as theta in [3]) of a distant node as measured from a point. If this size is below ‘angle’ then it is used as a summary node of all points contained within it. This method is not very sensitive to changes in this parameter in the range of 0.2 - 0.8. Angle less than 0.2 has quickly increasing computation time and angle greater 0.8 has quickly increasing error. Default 0.5
Returns:

A np.ndarray of principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by variance`

Return type:

numpy.ndarray