Data Utilities

Binary Policies

ExclusivePolicy

class BinaryPolicy.ExclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: BinaryPolicy

Implement the exclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters

digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.
X (np.ndarray) – Features which will be used for fitting a model.
y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.
sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) → tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

X (np.ndarray) – The subset with positive and negative features.
y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) → ndarray

Gather all negative examples corresponding to the given node.

This includes all examples except the positive ones.

Parameters: node – Node for which the negative examples should be searched.
Returns: negative_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

positive_examples(node) → ndarray

Gather all positive examples corresponding to the given node.

This only includes examples for the given node.

Parameters: node – Node for which the positive examples should be searched.
Returns: positive_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

LessExclusivePolicy

class BinaryPolicy.LessExclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: ExclusivePolicy

Implement the less exclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters

digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.
X (np.ndarray) – Features which will be used for fitting a model.
y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.
sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) → tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

X (np.ndarray) – The subset with positive and negative features.
y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) → ndarray

Gather all negative examples corresponding to the given node.

This includes all examples except the examples for the current node and its children.

Parameters: node – Node for which the negative examples should be searched.
Returns: negative_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

positive_examples(node) → ndarray

Gather all positive examples corresponding to the given node.

This only includes examples for the given node.

Parameters: node – Node for which the positive examples should be searched.
Returns: positive_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

InclusivePolicy

class BinaryPolicy.InclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: BinaryPolicy

Implement the inclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters

digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.
X (np.ndarray) – Features which will be used for fitting a model.
y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.
sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) → tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

X (np.ndarray) – The subset with positive and negative features.
y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) → ndarray

Gather all negative examples corresponding to the given node.

This includes all examples, except the examples for the given node, its descendants and successors.

Parameters: node – Node for which the negative examples should be searched.
Returns: negative_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

positive_examples(node) → ndarray

Gather all positive examples corresponding to the given node.

This includes examples for the given node and its descendants.

Parameters: node – Node for which the positive examples should be searched.
Returns: positive_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

LessInclusivePolicy

class BinaryPolicy.LessInclusivePolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: InclusivePolicy

Implement the less inclusive policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters

digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.
X (np.ndarray) – Features which will be used for fitting a model.
y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.
sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) → tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

X (np.ndarray) – The subset with positive and negative features.
y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) → ndarray

Gather all negative examples corresponding to the given node.

This includes all examples except the examples for the current node and its children.

Parameters: node – Node for which the negative examples should be searched.
Returns: negative_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

positive_examples(node) → ndarray

Gather all positive examples corresponding to the given node.

This includes examples for the given node and its descendants.

Parameters: node – Node for which the positive examples should be searched.
Returns: positive_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

SiblingsPolicy

class BinaryPolicy.SiblingsPolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: InclusivePolicy

Implement the siblings policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters

digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.
X (np.ndarray) – Features which will be used for fitting a model.
y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.
sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) → tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

X (np.ndarray) – The subset with positive and negative features.
y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) → ndarray

Gather all negative examples corresponding to the given node.

This includes all examples for nodes that have the same ancestors as the given node, as well as their descendants.

Parameters: node – Node for which the negative examples should be searched.
Returns: negative_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

positive_examples(node) → ndarray

Gather all positive examples corresponding to the given node.

This includes examples for the given node and its descendants.

Parameters: node – Node for which the positive examples should be searched.
Returns: positive_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

ExclusiveSiblingsPolicy

class BinaryPolicy.ExclusiveSiblingsPolicy(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Bases: ExclusivePolicy

Implement the exclusive siblings policy of the referenced paper.

__init__(digraph: DiGraph, X: ndarray, y: ndarray, sample_weight=None)

Initialize a BinaryPolicy with the required data.

Parameters

digraph (nx.DiGraph) – DiGraph which is used for inferring nodes relationships.
X (np.ndarray) – Features which will be used for fitting a model.
y (np.ndarray) – Labels which will be assigned to the different samples. Has to be 2D array.
sample_weight (array-like of shape (n_samples,), default=None) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

get_binary_examples(node) → tuple

Gather all positive and negative examples for a given node.

Parameters

node – Node for which the positive and negative examples should be searched.

Returns

X (np.ndarray) – The subset with positive and negative features.
y (np.ndarray) – The subset with positive and negative labels.

negative_examples(node) → ndarray

Gather all negative examples corresponding to the given node.

This includes examples for all nodes that have the same parent as the given node.

Parameters: node – Node for which the negative examples should be searched.
Returns: negative_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

positive_examples(node) → ndarray

Gather all positive examples corresponding to the given node.

This only includes examples for the given node.

Parameters: node – Node for which the positive examples should be searched.
Returns: positive_examples – A mask for which examples are included (True) and which are not.
Return type: np.ndarray

Hierarchical Metrics

Precision

metrics.precision(y_true: ndarray, y_pred: ndarray, average: str = 'micro')

Compute hierarchical precision score.

Parameters

y_true (np.array of shape (n_samples, n_levels)) – Ground truth (correct) labels.
y_pred (np.array of shape (n_samples, n_levels)) – Predicted labels, as returned by a classifier.
average ({"micro", "macro"}, str, default="micro") –
This parameter determines the type of averaging performed during the computation:
- micro: The precision is computed by summing over all individual instances, \(\displaystyle{hP = \frac{\sum_{i=1}^{n}| \alpha_i \cap \beta_i |}{\sum_{i=1}^{n}| \alpha_i |}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors, with summations computed over all test examples.
- macro: The precision is computed for each instance and then averaged, \(\displaystyle{hP = \frac{\sum_{i=1}^{n}hP_{i}}{n}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors.

Returns

precision – What proportion of positive identifications was actually correct?

Return type

float

Recall

metrics.recall(y_true: ndarray, y_pred: ndarray, average: str = 'micro')

Compute hierarchical recall score.

Parameters

y_true (np.array of shape (n_samples, n_levels)) – Ground truth (correct) labels.
y_pred (np.array of shape (n_samples, n_levels)) – Predicted labels, as returned by a classifier.
average ({"micro", "macro"}, str, default="micro") –
This parameter determines the type of averaging performed during the computation:
- micro: The recall is computed by summing over all individual instances, \(\displaystyle{hR = \frac{\sum_{i=1}^{n}|\alpha_i \cap \beta_i|}{\sum_{i=1}^{n}|\beta_i|}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors, with summations computed over all test examples.
- macro: The recall is computed for each instance and then averaged, \(\displaystyle{hR = \frac{\sum_{i=1}^{n}hR_{i}}{n}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors.

Returns

recall – What proportion of actual positives was identified correctly?

Return type

float

F-score

metrics.f1(y_true: ndarray, y_pred: ndarray, average: str = 'micro')

Compute hierarchical f-score.

Parameters

y_true (np.array of shape (n_samples, n_levels)) – Ground truth (correct) labels.
y_pred (np.array of shape (n_samples, n_levels)) – Predicted labels, as returned by a classifier.
average ({"micro", "macro"}, str, default="micro") –
This parameter determines the type of averaging performed during the computation:
- micro: The f-score is computed by summing over all individual instances, \(\displaystyle{hF = \frac{2 \times hP \times hR}{hP + hR}}\), where \(hP\) is the hierarchical precision and \(hR\) is the hierarchical recall.
- macro: The f-score is computed for each instance and then averaged, \(\displaystyle{hF = \frac{\sum_{i=1}^{n}hF_{i}}{n}}\), where \(\alpha_i\) is the set consisting of the most specific classes predicted for test example \(i\) and all their ancestor classes, while \(\beta_i\) is the set containing the true most specific classes of test example \(i\) and all their ancestors.

Returns

f1 – Weighted average of the precision and recall

Return type

float

Datasets

Platypus diseases dataset

datasets.load_platypus(test_size=0.3, random_state=42)

Load platypus diseases dataset.

Parameters

test_size (float, default=0.3) – The proportion of the dataset to include in the test split.
random_state (int or None, default=42) – Controls the randomness of the dataset. Pass an int for reproducible output across multiple function calls.

Returns

List containing train-test split of inputs.

Return type

list

Raises

RuntimeError – If failed to access or process the dataset.

Examples

>>> from hiclass.datasets import load_platypus
>>> X_train, X_test, Y_train, Y_test = load_platypus()
>>> X_train[:3]
     fever  diarrhea  stomach pain  skin rash  cough  sniffles  short breath  headache  size
220   37.8         0             3          5      1         1             0         2  27.6
539   37.2         0             6          1      1         1             0         3  28.4
326   39.9         0             2          5      1         1             1         2  30.7
>>> X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
(572, 9) (246, 9) (572,) (246,)

Hierarchical text classification dataset

datasets.load_hierarchical_text_classification(test_size=0.3, random_state=42)

Load hierarchical text classification dataset.

Parameters

test_size (float, default=0.3) – The proportion of the dataset to include in the test split.
random_state (int or None, default=42) – Controls the randomness of the dataset. Pass an int for reproducible output across multiple function calls.

Returns

List containing train-test split of inputs.

Return type

list

Raises

RuntimeError – If failed to access or process the dataset.

Examples

>>> from hiclass.datasets import load_hierarchical_text_classification
>>> X_train, X_test, Y_train, Y_test = load_hierarchical_text_classification()
>>> X_train[:3]
38015                                Nature's Way Selenium
2281         Music In Motion Developmental Mobile W Remote
36629    Twinings Ceylon Orange Pekoe Tea, Tea Bags, 20...
Name: Title, dtype: object
>>> X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
(28000,) (12000,) (28000, 3) (12000, 3)