API Reference

Most functionality comes from the FuzzyCat class, with additional methods in the FuzzyData and FuzzyPlots scripts helping with pre- and post-analysis.

FuzzyCat

class fuzzycat.fuzzycat.FuzzyCat(nSamples, nPoints, directoryName=None, minIntraJaccardIndex=0.5, maxInterJaccardIndex=0.5, minStability=0.5, windowSize=None, checkpoint=False, verbose=0)

A class to represent the FuzzyCat algorithm.

FuzzyCat is a generalised method of producing a soft hierarchy of soft clusters from a series of existing clusterings that have been generated using different representations of the same point-based data set. The core concept of the algorithm is that a density-based clustering of these existing clusters can be found, since the Jaccard index (distance) between clusters endows the space of clusters with a similarity (metric). The resultant clusters of clusters thereby propagate any hidden effects of the process that is underlying the different representations of the data set.

Parameters

nSamplesint

The number of samples and therefore the number of clusterings that were generated.

nPointsint

The number of points in the data set.

directoryNamestr or None, default = None

The file path of the directory where the cluster files are stored. If None, the current working directory is used. The directory must contain a subdirectory called ‘Clusters’ that contains the cluster files. The cluster files must be in the form of numpy arrays and be named according to the ‘XXX_A-B-C.npy’ format, where ‘XXX’ is the 0-padded sample number (integer in [0, nSamples - 1]) and ‘A-B-C’ is the hierarchical cluster id of the cluster. The cluster files must also have the ‘.npy’ extension. In addition, the cluster arrays must either; contain the indices (integers) of the points that belong to the cluster, or contain the membership probabilities (floats) of the points in the cluster.

minIntraJaccardIndexfloat, default = 0.5

The minimum Jaccard index that at least two clusters within a fuzzy cluster must have for it be included in the final set of fuzzy clusters.

maxInterJaccardIndexfloat, default = 0.5

The maximum Jaccard index that any two fuzzy clusters can have for them to be able to included in the final set of fuzzy clusters.

minStabilityfloat, default = 0.5

The minimum stability that a fuzzy cluster must have to be included in the final set of fuzzy clusters.

windowSizeint or None, default = None

The size of the window that is used to compute the Jaccard index between clusters. If None, the Jaccard index is computed between all pairs of clusters in all clusterings. If windowSize is an integer, then the Jaccard index is computed between all clusters within windowSize-many clusterings of one another. The order of the clusters is determined by the lexographical order of the cluster files in directoryName.

checkpointbool, default = False

Whether to save the cluster file names, pairs, and edges arrays to the directory so that less work is needed if FuzzyCat is run again.

verboseint, default = 0

The verbosity of the FuzzyCat class. If verbose is set to 0, then FuzzyCat will not report any of its activity. Increasing verbose will make FuzzyCat report more of its activity.

Attributes

clusterFileNamesnumpy.ndarray of shape (n_clusters,)

The names of the cluster files that FuzzyCat has used. The files have been found in the subdirectory, directoryName + ‘Clusters/’.

jaccardIndicesnumpy.ndarray of shape (n_clusters,)

The maximum Jaccard index the clusters share with any other cluster.

orderingnumpy.ndarray of shape (n_clusters,)

The ordering of the clusters in the ordered list of fuzzy structure. The ordered-jaccard plot can be created by plotting y = jaccardIndices[ordering] vs x = range(ordering.size).

fuzzyClustersnumpy.ndarray of shape (n_fuzzyclusters, 2)

The start and end positions of each fuzzy cluster as it appears in the ordered list, such that ordering[fuzzyClusters[i, 0]:fuzzyClusters[i, 1]] gives an array of the indices of the points within cluster i.

stabilitiesnumpy.ndarray of shape (n_fuzzyclusters,)

The stability of each fuzzy cluster.

membershipsnumpy.ndarray of shape (n_fuzzyclusters, nPoints)

The membership probabilities of each point to each fuzzy cluster.

memberships_flatnumpy.ndarray of shape (n_fuzzyclusters, nPoints)

The flattened membership probabilities of each point to each fuzzy cluster, excluding the hierarchy correction. If the fuzzy clusters are inherently hierarchical, then memberships_flat may be more easily interpretable than memberships since memberships_flat.sum(axis = 0) will be contained within the interval [0, 1].

fuzzyHierarchynumpy.ndarray of shape (n_fuzzyclusters, n_fuzzyclusters)

Information about the fuzzy hierarchy of the fuzzy clusters. The value of fuzzyHierarchy[i, j] is the probability that fuzzy cluster j is a child of fuzzy cluster i.

groupsnumpy.ndarray of shape (n_groups, 2)

Similar to fuzzyClusters, however groups includes all possible fuzzy clusters before they have been selected for with minIntraJaccardIndex, maxInterJaccardIndex, and minStability.

intraJaccardIndicesGroupsnumpy.ndarray of shape (n_groups,)

The maxmimum Jaccard Index value of each group in groups, such that intraJaccardIndicesGroups[i] corresponds to group i in groups.

interJaccardIndicesGroupsnumpy.ndarray of shape (n_groups,)

The maximum Jaccard Index value that each group shares with another group in groups, such that interJaccardIndicesGroups[i] corresponds to group i in groups.

stabilitiesGroupsnumpy.ndarray of shape (n_groups,)

The stability of each group in groups, such that stabilitiesGroups[i] corresponds to group i in groups.

aggregate()

Aggregates the clusters together to form the ordered list whilst keeping track of groups.

Sorts cluster pairs into descending order of edge weight and aggregates the clusters while keeping track of structural information about the data.

This method requires the _pairs and ‘_edges’ attributes to have already been created, via the computeSimilarities method or otherwise.

This method generates the jaccardIndices, ordering, groups, intraJaccardIndicesGroups, interJaccardIndicesGroups, and stabilitiesGroups attributes.

This method deletes the _pairs and ‘_edges’ attributes.

computeSimilarities()

Computes the similarities between all pairs of clusters in the chosen directory.

This method requires the directory to contain a subdirectory called ‘Clusters/’ that contains the cluster files.

This method generates the _pairs and _edges attributes.

extractFuzzyClusters()

Classifies fuzzy clusters as the smallest groups that meet the minIntraJaccardIndex, maxInterJaccardIndex, and minStability requirements.

This method requires the ordering, groups, intraJaccardIndicesGroups, interJaccardIndicesGroups, stabilitiesGroups, and _sampleNumbers attributes to have already been created, via the aggregate() method or otherwise. It also requires the clusterFileNames attribute to have already been created, via the computeSimilarities method or otherwise. In addition, this method requires the directory to contain a subdirectory called ‘Clusters/’ that contains the cluster files.

This method generates the ‘fuzzyClusters’, intraJaccardIndices, interJaccardIndices, stabilities, memberships, memberships_flat, and fuzzyHierarchy attributes.

run()

Runs the FuzzyCat algorithm and produces fuzzy clusters from a directory containing a folder, ‘Cluster/’, with existing cluster files.

This method runs computeSimilarities(), aggregate(), and extractFuzzyClusters().

FuzzyData

fuzzycat.FuzzyData.clusteringsFromRandomSamples(P, covP, nSamples=100, directoryName=None, clusteringAlgorithm='astrolink', clusteringAlgorithmArgs=None, workers=-1)

Generates random samples of a fuzzy data set, runs a clustering algorithm on each sample, and saves the clusters as .npy files so that FuzzyCat can use them.

Parameters

Pnumpy.ndarray

The mean values of the fuzzy data set from which to generate random samples from.

covPfloat or numpy.ndarray

The covariance matrix of the fuzzy data set from which to generate random samples from. The random samples of points can be either homogenous or heterogenous and either spherically-symmetric, axis-aligned, or multivariate.

nSamplesint, default is 100

The number of random samples to generate.

directoryNamestr, default is None

The directory in which to save the clusters. If None, the current working directory is used.

clusteringAlgorithmstr or callable, default is ‘astrolink’

The clustering algorithm to use. If a string, the following are supported: ‘kmeans’, ‘gaussianmixture’, ‘hdbscan’, ‘astrolink’. If a callable, the function must take the following parameters:

  • P_samplenumpy.ndarray

    A random sample of the fuzzy data set.

  • iterationint

    A uniquely identifying number corresponding to this random sample.

  • nSamplesint

    The total number of random samples that are being generated.

  • directoryNamestr

    The directory in which to save the clusters.

  • **clusteringAlgorithmArgsdict

    Additional keyword arguments to pass to the clustering algorithm.

clusteringAlgorithmArgsdict, default is None

Additional keyword arguments to pass to the clustering algorithm.

workersint, default is -1

The number of CPU cores to use. If -1, all available cores are used.

fuzzycat.FuzzyData.runAndSaveClustering(P_sample, iteration, nSamples=1000, directoryName=None, clusteringAlgorithm='astrolink', clusteringAlgorithmArgs=None)

Runs a clustering algorithm on a sample of a data set and saves the clusters as .npy files so that FuzzyCat can use them.

Parameters

P_samplenumpy.ndarray

A random sample of the fuzzy data set.

iterationint

A uniquely identifying number corresponding to this random sample.

nSamplesint, default is 1000

The total number of random samples that are being generated.

directoryNamestr, default is None

The directory in which to save the clusters. If None, the current working directory is used.

clusteringAlgorithmstr or callable, default is ‘astrolink’

The clustering algorithm to use. If a string, the following are supported: ‘kmeans’, ‘gaussianmixture’, ‘hdbscan’, ‘astrolink’. If a callable, the function must take the following parameters:

  • P_samplenumpy.ndarray

    A random sample of the fuzzy data set.

  • iterationint

    A uniquely identifying number corresponding to this random sample.

  • nSamplesint

    The total number of random samples that are being generated.

  • directoryNamestr

    The directory in which to save the clusters.

  • **clusteringAlgorithmArgsdict

    Additional keyword arguments to pass to the clustering algorithm.

clusteringAlgorithmArgsdict, default is None

Additional keyword arguments to pass to the clustering algorithm.

fuzzycat.FuzzyData.sqrtCovPCase6_njit(covP)

A fast implementation of the square root of the covariance matrices of that describe heterogenous multivariate Gaussian noise.

Parameters

covPnumpy.ndarray

The covariance matrices of the fuzzy data set.

FuzzyPlots

fuzzycat.FuzzyPlots.plotFuzzyLabelsOnX(fc, X, membersOnly=False, figsize=(8, 8), markerSize=5, save=True, show=False, dpi=None)

Creates a scatter plot of the data points in X and colours them according to the fuzzy clusters that have been determined by FuzzyCat.

Parameters

fcFuzzyCat

An instance of the FuzzyCat class.

Xnumpy.ndarray

The data points that are to be displayed in the figure.

figsizetuple, default is (8, 8)

The size of the figure in inches.

markerSizeint or float, default is 5

The size of the markers in the scatter plot.

savebool, default is True

If True, save the figure to the directory stored in fc.directoryName.

showbool, default is False

If True, display the figure.

dpiint or None, default is None

The resolution of the figure.

fuzzycat.FuzzyPlots.plotMemberships(fc, figsize=(8, 8), bins=None, save=True, show=False, dpi=400)

Creates a histogram of the memberships of the fuzzy clusters that have been determined by FuzzyCat.

Parameters

fcFuzzyCat

An instance of the FuzzyCat class.

figsizetuple, default is (8, 8)

The size of the figure in inches.

binsarray-like or None, default is None

The bins used in the histogram. If None, 10 bins are set to be 0.1 wide spanning the range [0, 1].

savebool, default is True

If True, save the figure to the directory stored in fc.directoryName.

showbool, default is False

If True, display the figure.

dpiint, default is 400

The resolution of the figure.

fuzzycat.FuzzyPlots.plotOrderedJaccardIndex(fc, figsize=(8, 8), linewidth=0.5, save=True, show=False, dpi=400)

Creates a the ordered Jaccard Index plot and overlays the fuzzy clusters that have been determined by FuzzyCat.

Parameters

fcFuzzyCat

An instance of the FuzzyCat class.

figsizetuple, default is (8, 8)

The size of the figure in inches.

linewidthfloat, default is 0.5

The width of the line.

savebool, default is True

If True, save the figure to the directory stored in fc.directoryName.

showbool, default is False

If True, display the figure.

dpiint, default is 400

The resolution of the figure.

fuzzycat.FuzzyPlots.plotStabilities(fc, figsize=(8, 8), bins=None, save=True, show=False, dpi=400)

Creates a histogram of the stabilities of the fuzzy clusters that have been determined by FuzzyCat.

Parameters

fcFuzzyCat

An instance of the FuzzyCat class.

figsizetuple, default is (8, 8)

The size of the figure in inches.

binsarray-like, str, or None, default is None

The bins used in the histogram. If None, 10 bins are set to be 0.1 wide spanning the range [0, 1].

savebool, default is True

If True, save the figure to the directory stored in fc.directoryName.

showbool, default is False

If True, display the figure.

dpiint, default is 400

The resolution of the figure.