Data Utilities

Binary Policies

ExclusivePolicy

class BinaryPolicy.ExclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: BinaryPolicy

Implement the exclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters
  • digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.

  • X (np.ndarray) – Features which will be used for fitting a model.

  • y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.

  • sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

  • X (np.ndarray) – The subset with positive and negative features.

  • y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) ndarray

Gather all negative examples corresponding to the given node.

This includes all examples except the positive ones.

Parameters

node – Node for which the negative examples should be searched.

Returns

negative_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray

positive_examples(node) ndarray

Gather all positive examples corresponding to the given node.

This only includes examples for the given node.

Parameters

node – Node for which the positive examples should be searched.

Returns

positive_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray


LessExclusivePolicy

class BinaryPolicy.LessExclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: ExclusivePolicy

Implement the less exclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters
  • digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.

  • X (np.ndarray) – Features which will be used for fitting a model.

  • y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.

  • sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

  • X (np.ndarray) – The subset with positive and negative features.

  • y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) ndarray

Gather all negative examples corresponding to the given node.

This includes all examples except the examples for the current node and its children.

Parameters

node – Node for which the negative examples should be searched.

Returns

negative_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray

positive_examples(node) ndarray

Gather all positive examples corresponding to the given node.

This only includes examples for the given node.

Parameters

node – Node for which the positive examples should be searched.

Returns

positive_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray


InclusivePolicy

class BinaryPolicy.InclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: BinaryPolicy

Implement the inclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters
  • digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.

  • X (np.ndarray) – Features which will be used for fitting a model.

  • y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.

  • sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

  • X (np.ndarray) – The subset with positive and negative features.

  • y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) ndarray

Gather all negative examples corresponding to the given node.

This includes all examples, except the examples for the given node, its descendants and successors.

Parameters

node – Node for which the negative examples should be searched.

Returns

negative_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray

positive_examples(node) ndarray

Gather all positive examples corresponding to the given node.

This includes examples for the given node and its descendants.

Parameters

node – Node for which the positive examples should be searched.

Returns

positive_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray


LessInclusivePolicy

class BinaryPolicy.LessInclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: InclusivePolicy

Implement the less inclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters
  • digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.

  • X (np.ndarray) – Features which will be used for fitting a model.

  • y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.

  • sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

  • X (np.ndarray) – The subset with positive and negative features.

  • y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) ndarray

Gather all negative examples corresponding to the given node.

This includes all examples except the examples for the current node and its children.

Parameters

node – Node for which the negative examples should be searched.

Returns

negative_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray

positive_examples(node) ndarray

Gather all positive examples corresponding to the given node.

This includes examples for the given node and its descendants.

Parameters

node – Node for which the positive examples should be searched.

Returns

positive_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray


SiblingsPolicy

class BinaryPolicy.SiblingsPolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: InclusivePolicy

Implement the siblings policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters
  • digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.

  • X (np.ndarray) – Features which will be used for fitting a model.

  • y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.

  • sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

  • X (np.ndarray) – The subset with positive and negative features.

  • y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) ndarray

Gather all negative examples corresponding to the given node.

This includes all examples for nodes that have the same ancestors as the given node, as well as their descendants.

Parameters

node – Node for which the negative examples should be searched.

Returns

negative_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray

positive_examples(node) ndarray

Gather all positive examples corresponding to the given node.

This includes examples for the given node and its descendants.

Parameters

node – Node for which the positive examples should be searched.

Returns

positive_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray


ExclusiveSiblingsPolicy

class BinaryPolicy.ExclusiveSiblingsPolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: ExclusivePolicy

Implement the exclusive siblings policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters
  • digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.

  • X (np.ndarray) – Features which will be used for fitting a model.

  • y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.

  • sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

  • X (np.ndarray) – The subset with positive and negative features.

  • y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) ndarray

Gather all negative examples corresponding to the given node.

This includes examples for all nodes that have the same parent as the given node.

Parameters

node – Node for which the negative examples should be searched.

Returns

negative_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray

positive_examples(node) ndarray

Gather all positive examples corresponding to the given node.

This only includes examples for the given node.

Parameters

node – Node for which the positive examples should be searched.

Returns

positive_examples – A mask for which examples are included (True) and which are not.

Return type

np.ndarray


Hierarchical Metrics

Precision

metrics.precision(y_true: ndarray, y_pred: ndarray, average: str = 'micro')

Compute hierarchical precision score.

Parameters
  • y_true (np.array of shape (n_samples, n_levels)) – Ground truth (correct) labels.

  • y_pred (np.array of shape (n_samples, n_levels)) – Predicted labels, as returned by a classifier.

  • average ({"micro", "macro"}, str, default="micro") –

    This parameter determines the type of averaging performed during the computation:

    • micro: The precision is computed by summing over all individual instances, \(\displaystyle{hP = \frac{\sum_{i=1}^{n}| \alpha_i \cap \beta_i |}{\sum_{i=1}^{n}| \alpha_i |}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors, with summations computed over all test examples.

    • macro: The precision is computed for each instance and then averaged, \(\displaystyle{hP = \frac{\sum_{i=1}^{n}hP_{i}}{n}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors.

Returns

precision – What proportion of positive identifications was actually correct?

Return type

float


Recall

metrics.recall(y_true: ndarray, y_pred: ndarray, average: str = 'micro')

Compute hierarchical recall score.

Parameters
  • y_true (np.array of shape (n_samples, n_levels)) – Ground truth (correct) labels.

  • y_pred (np.array of shape (n_samples, n_levels)) – Predicted labels, as returned by a classifier.

  • average ({"micro", "macro"}, str, default="micro") –

    This parameter determines the type of averaging performed during the computation:

    • micro: The recall is computed by summing over all individual instances, \(\displaystyle{hR = \frac{\sum_{i=1}^{n}|\alpha_i \cap \beta_i|}{\sum_{i=1}^{n}|\beta_i|}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors, with summations computed over all test examples.

    • macro: The recall is computed for each instance and then averaged, \(\displaystyle{hR = \frac{\sum_{i=1}^{n}hR_{i}}{n}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors.

Returns

recall – What proportion of actual positives was identified correctly?

Return type

float


F-score

metrics.f1(y_true: ndarray, y_pred: ndarray, average: str = 'micro')

Compute hierarchical f-score.

Parameters
  • y_true (np.array of shape (n_samples, n_levels)) – Ground truth (correct) labels.

  • y_pred (np.array of shape (n_samples, n_levels)) – Predicted labels, as returned by a classifier.

  • average ({"micro", "macro"}, str, default="micro") –

    This parameter determines the type of averaging performed during the computation:

    • micro: The f-score is computed by summing over all individual instances, \(\displaystyle{hF = \frac{2 \times hP \times hR}{hP + hR}}\), where \(hP\) is the hierarchical precision and \(hR\) is the hierarchical recall.

    • macro: The f-score is computed for each instance and then averaged, \(\displaystyle{hF = \frac{\sum_{i=1}^{n}hF_{i}}{n}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors.

Returns

f1 – Weighted average of the precision and recall

Return type

float


Datasets

Platypus diseases dataset

datasets.load_platypus(test_size=0.3, random_state=42)

Load platypus diseases dataset.

Parameters
  • test_size (float, default=0.3) – The proportion of the dataset to include in the test split.

  • random_state (int or None, default=42) – Controls the randomness of the dataset. Pass an int for reproducible output across multiple function calls.

Returns

List containing train-test split of inputs.

Return type

list

Raises

RuntimeError – If failed to access or process the dataset.

Examples

>>> from hiclass.datasets import load_platypus
>>> X_train, X_test, Y_train, Y_test = load_platypus()
>>> X_train[:3]
     fever  diarrhea  stomach pain  skin rash  cough  sniffles  short breath  headache  size
220   37.8         0             3          5      1         1             0         2  27.6
539   37.2         0             6          1      1         1             0         3  28.4
326   39.9         0             2          5      1         1             1         2  30.7
>>> X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
(572, 9) (246, 9) (572,) (246,)

Hierarchical text classification dataset

datasets.load_hierarchical_text_classification(test_size=0.3, random_state=42)

Load hierarchical text classification dataset.

Parameters
  • test_size (float, default=0.3) – The proportion of the dataset to include in the test split.

  • random_state (int or None, default=42) – Controls the randomness of the dataset. Pass an int for reproducible output across multiple function calls.

Returns

List containing train-test split of inputs.

Return type

list

Raises

RuntimeError – If failed to access or process the dataset.

Examples

>>> from hiclass.datasets import load_hierarchical_text_classification
>>> X_train, X_test, Y_train, Y_test = load_hierarchical_text_classification()
>>> X_train[:3]
38015                                Nature's Way Selenium
2281         Music In Motion Developmental Mobile W Remote
36629    Twinings Ceylon Orange Pekoe Tea, Tea Bags, 20...
Name: Title, dtype: object
>>> X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
(28000,) (12000,) (28000, 3) (12000, 3)