Parallel Training

Larger datasets require more time for training. While by default the models in HiClass are trained using a single core, it is possible to train each local classifier in parallel by leveraging the library Ray 1. If Ray is not installed, the parallelism defaults to Joblib. In this example, we demonstrate how to train a hierarchical classifier in parallel by setting the parameter n_jobs to use all the cores available. Training is performed on a mock dataset from Kaggle 2.

1: https://www.ray.io/
2: https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification

Out:

2024-05-10 16:35:11,985 WARNING services.py:2002 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=1.74gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

Pipeline(steps=[('count', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('lcppn',
                 LocalClassifierPerParentNode(local_classifier=LogisticRegression(max_iter=1000),
                                              n_jobs=2))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

import sys
from os import cpu_count

import pandas as pd
import requests
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from hiclass import LocalClassifierPerParentNode


# Download training data
url = "https://zenodo.org/record/6657410/files/train_40k.csv?download=1"
path = "train_40k.csv"
response = requests.get(url)
with open(path, "wb") as file:
    file.write(response.content)

# Load training data into pandas dataframe
training_data = pd.read_csv(path).fillna(" ")

# We will use logistic regression classifiers for every parent node
lr = LogisticRegression(max_iter=1000)

pipeline = Pipeline(
    [
        ("count", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        (
            "lcppn",
            LocalClassifierPerParentNode(local_classifier=lr, n_jobs=cpu_count()),
        ),
    ]
)

# Select training data
X_train = training_data["Title"]
Y_train = training_data[["Cat1", "Cat2", "Cat3"]]

# Fixes bug AttributeError: '_LoggingTee' object has no attribute 'fileno'
# This only happens when building the documentation
# Hence, you don't actually need it for your code to work
sys.stdout.fileno = lambda: False

# Now, let's train the local classifier per parent node
pipeline.fit(X_train, Y_train)

Total running time of the script: ( 1 minutes 20.405 seconds)

Gallery generated by Sphinx-Gallery