topologic.io package¶

topologic.io.consolidate_bipartite(csv_dataset: topologic.io.datasets.CsvDataset, vertex_column_index: int, pivot_column_index: int) → networkx.classes.graph.Graph[source]¶

class topologic.io.CsvDataset(source_iterator: Union[TextIO, Iterator[str]], has_headers: Optional[bool] = None, dialect: Union[str, csv.Dialect, None] = None, use_headers: Optional[List[str]] = None, sample_size: int = 50)[source]¶

Bases: object

FIELD_SIZE_LIMIT = 2147483647¶

dialect() → Union[_csv.Dialect, csv.Dialect][source]¶

Note: return type information is broken due to typeshed issues with the csv module.

Returns: Dialect used within this CsvDataset for the csv.reader.
Return type: Union[_csv.Dialect, csv.Dialect]

headers() → List[str][source]¶

Returns: Returns a copy of the headers.
Return type: List[str]

reader() → Iterator[List[str]][source]¶

Returns: Returns a properly configured csv reader for a given dialect
Return type: Iterator[List[str]]

topologic.io.find_edges(csv_dataset: topologic.io.datasets.CsvDataset, common_values_count: int = 20, rare_values_count: int = 20)[source]¶

topologic.io.from_dataset(csv_dataset: topologic.io.datasets.CsvDataset, projection_function_generator: Callable[[networkx.classes.graph.Graph], Callable[[List[str]], None]], graph: Optional[networkx.classes.graph.Graph] = None) → networkx.classes.graph.Graph[source]¶

Load a graph from a source csv

The most important part of this function is selecting the appropriate projection function generators. These functions generate yet another function generator, which in turn generates the function we will use to project the source CsvDataset into our graph.

The provided projection function generators fall into 3 groups:

edges we don’t want any metadata for (note that there is no vertex version of this - if you don’t want vertex metadata, don’t provide a vertex_csv_dataset or function!)
edges or vertices we want metadata for, but the file is ordered sequentially and we only want the last metadata to be available in the graph
edges or vertices we want metadata for, and we wish to keep track of every record of metadata for the edge or vertex in a list of metadata dictionaries

You can certainly provide your own projection function generators for specialized needs; just ensure they follow the type signature of Callable[[nx.Graph], Callable[[List[str]], None]]

Parameters

csv_dataset (CsvDataset) – the dataset to read from row by row
projection_function_generator (Callable[[nx.Graph], Callable[[List[str]], None]]) – The projection function generator function. When called with a nx.Graph, it will return the actual projection function to be used when processing each row of data.
graph (nx.Graph) – The graph to populate. If not provided a new one is created of type nx.Graph. Note that from_dataset can be called repeatedly with different edge or vertex csv_dataset files to populate the graph more and more. If you seek to take this approach, ensure you use the same Graph object from the previous calls so that it is continuously populated with the updated data from new files

Returns

the graph object

Return type

nx.Graph

topologic.io.from_file(edge_csv_file: TextIO, source_column_index: int, target_column_index: int, weight_column_index: Optional[int] = None, edge_csv_has_headers: Optional[bool] = None, edge_dialect: Union[str, csv.Dialect, None] = None, edge_csv_use_headers: Optional[List[str]] = None, edge_metadata_behavior: str = 'none', edge_ignored_values: Optional[List[str]] = None, vertex_csv_file: Optional[TextIO] = None, vertex_column_index: Optional[int] = None, vertex_csv_has_headers: Optional[bool] = None, vertex_dialect: Union[str, csv.Dialect, None] = None, vertex_csv_use_headers: Optional[List[str]] = None, vertex_metadata_behavior: str = 'single', vertex_ignored_values: Optional[List[str]] = None, sample_size: int = 50, is_digraph: bool = False) → networkx.classes.graph.Graph[source]¶

This function weaves a lot of graph materialization code into a single call.

The only required arguments are necessary for the bare minimum of creating a graph from an edge list. However, it is definitely recommended to specify whether the any data files use headers and a dialect; in this way we can avoid relying on the csv module’s sniffing ability to detect it for us. We only use a modest number of records to discern the likelihood of headers existing or what to use for column separation (tabs or commas? quotes or double quotes? Better to specify your own dialect than hope for the best, but the capability exists if you want to throw caution to the wind.

The entire vertex metadata portion is optional; if no vertex_csv_file is specified (or it is set to None), no attempt will be made to enrich the graph node metadata. The resulting vertex_metadata_types dictionary in the NamedTuple will be an empty dictionary and can be discarded.

Likewise if no metadata is requested for projection by the edge projection function, the edge_metadata_types dictionary in the NamedTuple will be an emtpy dictionary and can be discarded.

Lastly, it is important to note that the options for edge_metadata_behavior can only be the 3 string values specified in the documentation - see the docs for that parameter for details. This is also true for the vertex_metadata_behavior - see the docs for that parameter as well.

Parameters

edge_csv_file (typing.TextIO) – A csv file that represents the edges of a graph. This file must contain at minimum two columns: a source column and a target column. It is suggested there also exist a weight column with some form of numeric value (e.g. 30 or 30.0)
source_column_index (int) – The column index the source vertex will be in. Columns start at 0.
target_column_index (int) – The column index the target vertex will be in. Columns start at 0.
weight_column_index (Optional[int]) – The column index the weight vertex will be in. Columns start at 0. If no weight_column_index is provided, we use a count of the number of VertexA to VertexB edges that exist and use that as the weight.
edge_csv_has_headers (Optional[bool]) – Does the source CSV file contain headers? If so, we will skip the first line. If edge_csv_use_headers is a List[str], we will use those as headers for mapping any metadata. If it is None, we will use the header row as the headers, i.e. edge_csv_use_headers will take precedence over any headers in the source file, if applicable.
edge_dialect (Optional[Union[csv.Dialect, str]]) – The dialect to use when parsing the source CSV file. See https://docs.python.org/3/library/csv.html#csv.Dialect for more details. If the value is None, we attempt to use the csv module’s Sniffer class to detect which dialect to use based on a sample of the first 50 lines of the source csv file. String values can be used if you provide the strings “excel”, “excel-tab”, or “unix”
edge_csv_use_headers (Optional[List[str]]) – Optional. Headers to use for the edge file either because the source file does not contain them or because you wish to override them with your own in a programmatic fashion.
edge_metadata_behavior (str) –
Dictates what extra data, aside from source, target, and weight, that we use from the provided edge list.
- ”none” brings along no metadata.
- ”single” iterates through the file from top to bottom; any edges between VertexA and VertexB that had metadata retained during edge projection will be overwritten with the newest row corresponding with VertexA and VertexB. See also: Clobbering
- ”collection” iterates through the file from top to bottom; all new metadata detected between VertexA and VertexB is appended to the end of a list. All metadata is kept for all edges unless pruned via normal graph pruning mechanisms.
edge_ignored_values (List[str]) – Optional. A list of strings to reject retention of during projection, e.g. “NULL” or “N/A” or “NONE”. Any attribute value found to be one of these words will be ignored.
vertex_csv_file (Optional[typing.TextIO]) –
A csv file that represents the vertices of a graph. This file should contain a column whose values correspond with the vertex ID in either the source or column field in the edges. If no edge exists for a Vertex, no metadata is retained.

Note: If vertex_csv_file is None or not provided, <u>none</u> of the vertex_* arguments will be used.
vertex_column_index (Optional[int]) – The column index the vertex id will be in. Columns start at 0. See note on vertex_csv_file.
vertex_csv_has_headers (Optional[bool]) – Does the source CSV file contain headers? If so, we will skip the first line. If vertex_csv_use_headers is a List[str], we will use those as headers for mapping any metadata. If it is None, we will use the header row as the headers, i.e. vertex_csv_use_headers will take precedence over any header in the source file, if applicable.
str]] vertex_dialect (Optional[Union[csv.Dialect,) – The dialect to use when parsing the source CSV file. See https://docs.python.org/3/library/csv.html#csv.Dialect for more details. If the value is None, we attempt to use the csv module’s Sniffer class to detect which dialect to use based on a sample of the first 50 lines of the source csv file. String values can be used if you provide the strings “excel”, “excel-tab”, or “unix”
vertex_csv_use_headers (Optional[List[str]]) – Optional. Headers to use for the vertex file either because the source file does not contain them or because you wish to override them with your own in a programmatic fashion. See note on vertex_csv_file.
vertex_metadata_behavior (str) –
Dictates what we do with vertex metadata. Unlike edge metadata, there is no need to provide a vertex_metadata_behavior if you have no vertex metadata you wish to capture. No metadata will be stored for any vertex if it is not detected in the graph already; if there are no edges to or from VertexA, there will be no metadata retained for VertexA.
- ”simple” iterates through the file from top to bottom; any vertex that had already captured metadata through the vertex metadata projection process will be overwritten with the newest metadata corresponding with that vertex.
- ”collection” iterates through the file from top to bottom; all new metadata detected for a given vertex will
vertex_ignored_values (List[str]) – Optional. A list of strings to reject retention of during projection, e.g. “NULL” be appended to a metadata list. or “N/A” or “NONE”. Any attribute value found to be one of these words will be ignored.
sample_size (int) – The sample size to extract from the source CSV for use in Sniffing dialect or has_headers. Please note that this sample_size does NOT advance your underlying iterator, nor is there any guarantee that the csv Sniffer class will use every row extracted via sample_size. Setting this above 50 may not have the impact you hope for due to the csv.Sniffer.has_header function - it will use at most 20 rows.
is_digraph (bool) – If the data represents an undirected graph or a directed graph. Default is False.

Returns

The graph populated graph

Return type

nx.Graph

class topologic.io.GraphProperties(column_names, potential_edge_column_pairs, common_column_values, rare_column_values)[source]¶

Bases: object

column_names()[source]¶

common_column_values()[source]¶: Dictionary of column name to set of common values for that column and their counts

potential_edge_column_pairs()[source]¶

rare_column_values()[source]¶: Dictionary of column name to set of rare values for that column and their counts

topologic.io.load(edge_file: str, separator: str = 'excel', has_header: bool = True, source_index: int = 0, target_index: int = 1, weight_index: Optional[int] = None) → networkx.classes.graph.Graph[source]¶

Spartan, on-rails function to load an edge file.

Parameters

edge_file (str) – String path to an edge file on the filesystem
separator (str) – Valid values are ‘excel’ or ‘excel-tab’.
has_header (bool) – True if the edge file has a header line, False if not
source_index (int) – The column index for the source vertex (default 0)
target_index (int) – The column index for the target vertex (default 1)
weight_index (Optional[int]) – The column index for the edge weight (default None). If None, or if there is no column at weight_index, weights per edge are defaulted to 1.

Returns

class topologic.io.PotentialEdgeColumnPair(source, destination, score)[source]¶

Bases: object

destination()[source]¶

score()[source]¶

source()[source]¶

topologic.io.tensor_projection_reader(embedding_file_path: str, label_file_path: str) → Tuple[numpy.ndarray, List[List[str]]][source]¶

Reads the embedding and labels stored at the given paths and returns an np.ndarray and list of labels

Parameters

embedding_file_path (str) – Path to the embedding file
label_file_path (str) – Path to the labels file

Returns

An embedding and list of labels

Return type

(numpy.ndarray, List[List[str]])

topologic.io.tensor_projection_writer(embedding_file_path: str, label_file_path: str, vectors: numpy.ndarray, labels: Union[List[List[str]], List[str]], encoding: str = 'utf-8')[source]¶

Writes an embedding and labels to a given vector file path and label file path in a form that Tensorboard embedding projector can read

Parameters

embedding_file_path (str) – Path that the embedding file will be written
label_file_path (str) – Path that the label file will be written
vectors (numpy.ndarray) – A embedding represented as a np.ndarray
labels (Union[List[List[str]], List[str]]) – A list of lists where each inner list is the data for a single row in the embedding or a list of strings where each string is the single label for that tensor. If you pass in a List[List[str]] you allow multiple labels for a single tensor.
encoding (str) – The encoding used to write the file