XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

Question

I'm currently working on a Parallel and Distributed Computing project where I'm comparing the performance of both XGBoost and CatBoost when trained on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve training time, especially when tuned with appropriate parameters.

For XGBoost, despite trying multiple parameter combinations and using tree_method='gpu_hist' with predictor='gpu_predictor', I'm seeing minimal to no speedup over the CPU version. In some cases, the CPU even trains faster than the GPU.

For CatBoost, the difference is even more surprising. On the same dataset (~41,000 rows), the CPU version trains in about 5 seconds, while the GPU version takes over 105 seconds, even with GPU-optimized parameters like reduced iterations, lower max_bin, and higher learning_rate.

I suspect that the dataset size might be too small to fully utilize GPU parallelism, but I'm required to work with this dataset for the project. I'm trying to understand:

Is this lack of GPU speedup expected at this scale for XGBoost and CatBoost?

Are there specific parameter changes or data preparation steps that could better leverage GPU acceleration?

Is there any overhead in CatBoost GPU training (e.g., preprocessing or data transfer) that causes such a stark increase in training time for smaller datasets?

Any insights or recommendations would be greatly appreciated.

    @timer_decorator
    def train_xgboost_cpu(self, X_train, y_train):
        """
        Train XGBoost Classifier on CPU with parameters that perform less efficiently
        """
        print("Training XGBoost Classifier on CPU...")
        xgb_clf = xgb.XGBClassifier(
            n_estimators=1500,        
            max_depth=15,              
            learning_rate=0.01,      
            subsample=0.9,             
            colsample_bytree=0.9,     
            objective='binary:logistic',
            tree_method='hist',          
            n_jobs=self.n_jobs,                 
            random_state=42,
            max_bin=256,               
            grow_policy='depthwise'
            verbosity=1,
            use_label_encoder=False
        )
    print(f&quot;Training XGBoost CPU on data shape: {X_train.shape}&quot;)
    xgb_clf.fit(X_train, y_train)

    return xgb_clf


@timer_decorator
def train_xgboost_gpu(self, X_train, y_train):
    &quot;&quot;&quot;
    Train XGBoost Classifier with GPU acceleration optimized for performance
    &quot;&quot;&quot;
    if not XGB_GPU_AVAILABLE:
        print(&quot;XGBoost GPU support not available, falling back to CPU&quot;)
        return self.train_xgboost_cpu(X_train, y_train)

    # Initialize and train the model with GPU-optimized parameters
    print(&quot;Training XGBoost Classifier on GPU...&quot;)
    try:
        xgb_clf = xgb.XGBClassifier(
            n_estimators=1500,         
            max_depth=15,                
            learning_rate=0.01,          
            subsample=0.9,               
            colsample_bytree=0.9,        
            objective='binary:logistic',
            tree_method='gpu_hist',      
            predictor='gpu_predictor',   
            grow_policy='depthwise',     
            gpu_id=0,
            random_state=42,
            max_bin=256,                
            verbosity=1,
            use_label_encoder=False
        )
        xgb_clf.fit(X_train, y_train)
        return xgb_clf
    except Exception as e:
        print(f&quot;XGBoost GPU training failed: {e}&quot;)
        print(&quot;Falling back to CPU training&quot;)
        return self.train_xgboost_cpu(X_train, y_train)


@timer_decorator
    def train_catboost_cpu(self, X_train, y_train):
        """
        Train CatBoost Classifier on CPU
        """
        if not CATBOOST_AVAILABLE:
            print("CatBoost is not available")
            return None
    print(&quot;Training CatBoost Classifier on CPU...&quot;)
    try:
        catboost_clf = CatBoostClassifier(
            iterations=500,
            depth=6,
            learning_rate=0.05,
            loss_function='Logloss',
            eval_metric='F1',
            task_type='CPU',
            thread_count=self.n_jobs,  # Use all available CPU threads
            random_seed=42,
            verbose=False,
            bootstrap_type='Bernoulli',
            subsample=0.8,
            grow_policy='SymmetricTree',
            max_bin=254,
            min_data_in_leaf=1,
            rsm=0.8  # Feature sampling for CPU
        )
        catboost_clf.fit(X_train, y_train)
        return catboost_clf
    except Exception as e:
        print(f&quot;CatBoost CPU training failed: {e}&quot;)
        return None

#@timer_decorator
@timer_decorator
def train_catboost_gpu(self, X_train, y_train):
    &quot;&quot;&quot;
    Train CatBoost Classifier with GPU acceleration (Optimized)
    &quot;&quot;&quot;
    if not CATBOOST_GPU_AVAILABLE:
        print(&quot;CatBoost GPU support not available, falling back to CPU&quot;)
        return self.train_catboost_cpu(X_train, y_train)

    print(&quot;Training CatBoost Classifier on GPU...&quot;)
    try:
        # Convert data to float32 for GPU efficiency
        if isinstance(X_train, np.ndarray):
            X_train = X_train.astype(np.float32)

        # Optimized GPU parameters
        catboost_clf = CatBoostClassifier(
            iterations=300,            # Reduced from 500
            depth=8,                  # Increased depth for faster convergence
            learning_rate=0.15,       # Higher learning rate
            grow_policy='Depthwise',  # GPU-optimized growth policy
            loss_function='Logloss',
            eval_metric='F1',
            task_type='GPU',
            devices='0',              # Use first GPU
            thread_count=self.n_jobs, # Leverage CPU threads for preprocessing
            random_seed=42,
            verbose=50,
            bootstrap_type='Bernoulli',
            subsample=0.8,
            max_bin=64,              # Reduced for faster histogramming
            min_data_in_leaf=3,      # Reduced for faster splits
            rsm=0.7,                 # Feature sampling (30% of features)
            early_stopping_rounds=25,
            gpu_ram_part=0.8,        # Use more GPU memory
        )

        catboost_clf.fit(
            X_train, 
            y_train,
            verbose=False
        )
        return catboost_clf
    except Exception as e:
        print(f&quot;CatBoost GPU training failed: {e}&quot;)
        print(&quot;Falling back to CPU training&quot;)
        return self.train_catboost_cpu(X_train, y_train)
```


Key Details:
Dataset size : ~41,000 rows (small/medium-sized).
Goal : Compare CPU vs GPU performance.
Issue : Despite trying many parameter combinations, the GPU version does not show significant speedup over the CPU version.
Observation : I suspect the dataset size might be too small to fully utilize the GPU, but I have to work with this dataset regardless.

score 1 · Answer 1 · answered May 23 '25 at 02:56

In my experience GPU support can significantly boost XGBoost performance. But: Data size must, of course be big enough so that transfer to GPU memory is feasable, and, memory management must be set up so that GPU gets continously fed.

You report ~ 40.000 rows, which con be a lot or nothing, dedending on the feature size.

To effectively stream data from and to GPU, one must use specific xgboost data containers (DQuantileMatrix for example), and, most notably, have installed xgboost with RMM support (Rapid Memory Management). Otherwise data transfer cost will choke GPU performance gain. In my experience.

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

1 Answers1