Note
Go to the end to download the full example code.
Parallel Training
Larger datasets require more time for training.
While by default the models in HiClass are trained using a single core,
it is possible to train each local classifier in parallel by leveraging the library Ray [1].
If Ray is not installed, the parallelism defaults to Joblib.
In this example, we demonstrate how to train a hierarchical classifier in parallel by
setting the parameter n_jobs to use all the cores available. Training
is performed on a mock dataset from Kaggle [2].
INFO:hiclass.datasets:Downloading hierarchical text classification dataset..
/home/docs/checkouts/readthedocs.org/user_builds/hiclass/envs/v5.0.3/lib/python3.12/site-packages/sklearn/base.py:474: FutureWarning: `BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.
warnings.warn(
INFO:LCPPN:Creating digraph from 28000 2D labels
INFO:LCPPN:Detected 6 roots
INFO:LCPPN:Initializing local classifiers
INFO:LCPPN:Fitting local classifiers
INFO:LCPPN:Cleaning up variables that can take a lot of disk space
import sys
from os import cpu_count
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from hiclass import LocalClassifierPerParentNode
from hiclass.datasets import load_hierarchical_text_classification
# Load train and test splits
X_train, X_test, Y_train, Y_test = load_hierarchical_text_classification()
# We will use logistic regression classifiers for every parent node
lr = LogisticRegression(max_iter=1000)
pipeline = Pipeline(
[
("count", CountVectorizer()),
("tfidf", TfidfTransformer()),
(
"lcppn",
LocalClassifierPerParentNode(local_classifier=lr, n_jobs=cpu_count()),
),
]
)
# Fixes bug AttributeError: '_LoggingTee' object has no attribute 'fileno'
# This only happens when building the documentation
# Hence, you don't actually need it for your code to work
sys.stdout.fileno = lambda: False
# Now, let's train the local classifier per parent node
pipeline.fit(X_train, Y_train)
Total running time of the script: (0 minutes 30.659 seconds)