Post-Training Pruning of LLMs | Research

Overview

Note: This research was conducted in early 2025 and documented approximately one year later. The repository and notebooks are not fully organized, and some experimental results were lost over time. What remains represents our key findings and published models.

This research explores post-training pruning of Large Language Models, specifically LLaMA model families. We investigated how to efficiently compress LLMs by removing transformer blocks while preserving performance through strategic fine-tuning.

Collaborators: Rahat Kabir, Sharukh Khan

Pruning Methods

We explored multiple approaches to identify and remove redundant layers:

Magnitude Pruning: Initial experiments with weight magnitude-based pruning before moving to block-level methods
ShortGPT / Block Importance: Removes transformer blocks based on importance scores
Angular Distance: Measures block influence by comparing layer input/output angles
Attention Norm Analysis: Identifies redundant layers through attention weight norms

Key Results

Performance Across Benchmarks

Comparison of base LLaMA 3.2-3B vs pruned variants (4L, 5L, 6L, 7L, 12L removed) across standard benchmarks.

Benchmark Performance

Extended Results

Perplexity Recovery

A critical finding: fine-tuning dramatically recovers perplexity after pruning. The fine-tuned model with 5 layers removed shows significantly lower perplexity than non-finetuned pruned variants.

Perplexity Analysis

MMLU Subfield Analysis

Detailed performance breakdown across MMLU subfields including Medical Genetics, Professional Law, Philosophy, College Mathematics, High School Physics, and Machine Learning.

MMLU Subfields

Published Models

All pruned models are available on HuggingFace:

HuggingFace Models

LLaMA 3.2-3B pruned variants (4L, 5L, 6L, 7L, 12L removed)
LLaMA 3.1-8B pruned variants with angular block influence
Fine-tuned versions of pruned models

Skills & Knowledge Gained

Through this research, I developed hands-on experience in:

Model Compression & Pruning

Implementing block/layer removal using importance metrics (ShortGPT, Angular distance, Attention norms)
Understanding transformer architecture internals and layer redundancy
Analyzing trade-offs between model size and performance

Fine-Tuning & Recovery

Applying LoRA (Low-Rank Adaptation) to recover pruned model performance
Working with PEFT library for efficient fine-tuning
Iterating on hyperparameters to optimize recovery

LLM Evaluation

Benchmarking models using lm-evaluation-harness
Interpreting results across diverse tasks: HellaSwag, MMLU, ARC, BoolQ, LAMBADA
Understanding perplexity as a model quality metric

Model Deployment

Publishing trained models to HuggingFace Hub
Documenting model cards and usage instructions
Managing model versions and variants

Technologies

PyTorch, Hugging Face Transformers, PEFT (LoRA), lm-evaluation-harness, Lightning AI, Google Colab, Kaggle

Repository | Models

Research conducted: Early 2025 | Documented: January 2026