Parallel Training

Larger datasets require more time for training. While by default the models in HiClass are trained using a single core, it is possible to train each local classifier in parallel by leveraging the library Ray [1]. If Ray is not installed, the parallelism defaults to Joblib. In this example, we demonstrate how to train a hierarchical classifier in parallel by setting the parameter n_jobs to use all the cores available. Training is performed on a mock dataset from Kaggle [2].

INFO:hiclass.datasets:Downloading hierarchical text classification dataset..
INFO:LCPPN:Creating digraph from 28000 2D labels
INFO:LCPPN:Detected 6 roots
INFO:LCPPN:Initializing local classifiers
INFO:LCPPN:Fitting local classifiers
/home/docs/checkouts/readthedocs.org/user_builds/hiclass/envs/v5.0.6/lib/python3.12/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
INFO:LCPPN:Cleaning up variables that can take a lot of disk space
Pipeline(steps=[('count', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('lcppn',
                 LocalClassifierPerParentNode(local_classifier=LogisticRegression(max_iter=1000),
                                              n_jobs=2))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.



import sys
from os import cpu_count
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from hiclass import LocalClassifierPerParentNode
from hiclass.datasets import load_hierarchical_text_classification

# Load train and test splits
X_train, X_test, Y_train, Y_test = load_hierarchical_text_classification()

# We will use logistic regression classifiers for every parent node
lr = LogisticRegression(max_iter=1000)

pipeline = Pipeline(
    [
        ("count", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        (
            "lcppn",
            LocalClassifierPerParentNode(local_classifier=lr, n_jobs=cpu_count()),
        ),
    ]
)

# Fixes bug AttributeError: '_LoggingTee' object has no attribute 'fileno'
# This only happens when building the documentation
# Hence, you don't actually need it for your code to work
sys.stdout.fileno = lambda: False

# Now, let's train the local classifier per parent node
pipeline.fit(X_train, Y_train)

Total running time of the script: (1 minutes 18.958 seconds)

Gallery generated by Sphinx-Gallery