Back to portfolio
Published on Oct 7, 2023 by Vitor Sousa

LoRA and DoRA Implementation from Scratch

Concept art showing matrix overlays representing LoRA and DoRA adaptations

GitHub badge linking to the LoRA and DoRA implementation repository Check the repo: LoRA and DoRA Implementation.

Goal

Implement Low-Rank Adaptation (LoRA) and Weight-Decomposed Low-Rank Adaptation (DoRA) layers from scratch in PyTorch, applied to Multi-Layer Perceptron models. This project helps understand the internals of parameter-efficient fine-tuning (PEFT) techniques that have become essential for adapting large language models.

Why Parameter-Efficient Fine-Tuning?

Fine-tuning large language models with billions of parameters is computationally expensive and requires significant memory. Parameter-efficient fine-tuning techniques address this by updating only a small subset of parameters while keeping the pretrained weights frozen.

Key benefits:

  • Reduced memory footprint during training
  • Faster training iterations
  • Lower risk of catastrophic forgetting
  • Ability to maintain multiple task-specific adaptations

LoRA: Low-Rank Adaptation

The Core Idea

LoRA introduces trainable low-rank matrices alongside frozen pretrained weights. Instead of updating the full weight matrix W, we learn a low-rank decomposition that captures the task-specific adaptations.

Mathematical Formulation

For a pretrained weight matrix W₀ ∈ ℝᵈˣᵏ, LoRA modifies the forward pass as:

h = W₀x + ΔWx = W₀x + BAx

Where:

  • B ∈ ℝᵈˣʳ and A ∈ ℝʳˣᵏ are the low-rank matrices
  • r << min(d, k) is the rank (typically 4, 8, or 16)
  • Only A and B are trained; W₀ remains frozen

The effective weight update becomes:

W' = W₀ + α · B · A

Where α is a scaling factor that controls the magnitude of the adaptation.

Implementation

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 4,
        alpha: float = 1.0
    ):
        super().__init__()
        self.rank = rank
        self.alpha = alpha

        # Low-rank matrices
        self.A = nn.Parameter(torch.zeros(rank, in_features))
        self.B = nn.Parameter(torch.zeros(out_features, rank))

        # Initialize A with Kaiming and B with zeros
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        nn.init.zeros_(self.B)

        # Scaling factor
        self.scaling = alpha / rank

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Compute low-rank adaptation: B @ A @ x
        return (x @ self.A.T @ self.B.T) * self.scaling

Applying LoRA to a Linear Layer

class LinearWithLoRA(nn.Module):
    def __init__(
        self,
        linear: nn.Linear,
        rank: int = 4,
        alpha: float = 1.0
    ):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            in_features=linear.in_features,
            out_features=linear.out_features,
            rank=rank,
            alpha=alpha
        )
        # Freeze the original weights
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original output + LoRA adaptation
        return self.linear(x) + self.lora(x)

Parameter Efficiency

For a weight matrix of size d × k:

  • Full fine-tuning: d × k parameters
  • LoRA with rank r: r × (d + k) parameters

Example: For d = k = 4096 and r = 8:

  • Full: 16,777,216 parameters
  • LoRA: 65,536 parameters (0.39% of original)

DoRA: Weight-Decomposed Low-Rank Adaptation

The Innovation

DoRA extends LoRA by decomposing weights into magnitude and direction components, mimicking the nuanced adjustments observed in full fine-tuning.

Mathematical Formulation

DoRA decomposes the weight update as:

W' = m · (W₀ + BA) / ‖W₀ + BA‖_c

Where:

  • m is a learned magnitude vector (per output dimension)
  • W₀ + BA represents the directional component
  • ‖·‖_c denotes column-wise normalization

Key Insight

Research shows that full fine-tuning makes subtle adjustments to both magnitude and direction, while standard LoRA primarily affects direction. DoRA’s decomposition allows for more expressive adaptations.

Implementation

class DoRALayer(nn.Module):
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 4,
        alpha: float = 1.0
    ):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        # Low-rank matrices (same as LoRA)
        self.A = nn.Parameter(torch.zeros(rank, in_features))
        self.B = nn.Parameter(torch.zeros(out_features, rank))

        # Learnable magnitude vector
        self.m = nn.Parameter(torch.ones(out_features))

        # Initialize
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        nn.init.zeros_(self.B)

    def forward(
        self,
        x: torch.Tensor,
        base_weight: torch.Tensor
    ) -> torch.Tensor:
        # Compute the adapted weight
        lora_update = self.B @ self.A * self.scaling
        adapted_weight = base_weight + lora_update

        # Normalize column-wise (direction)
        weight_norm = adapted_weight.norm(p=2, dim=1, keepdim=True)
        direction = adapted_weight / (weight_norm + 1e-8)

        # Apply magnitude scaling
        final_weight = self.m.unsqueeze(1) * direction

        return x @ final_weight.T

Applying DoRA to a Linear Layer

class LinearWithDoRA(nn.Module):
    def __init__(
        self,
        linear: nn.Linear,
        rank: int = 4,
        alpha: float = 1.0
    ):
        super().__init__()
        self.linear = linear
        self.dora = DoRALayer(
            in_features=linear.in_features,
            out_features=linear.out_features,
            rank=rank,
            alpha=alpha
        )

        # Initialize magnitude from pretrained weights
        with torch.no_grad():
            weight_norm = linear.weight.norm(p=2, dim=1)
            self.dora.m.copy_(weight_norm)

        # Freeze original weights
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        output = self.dora(x, self.linear.weight)
        if self.linear.bias is not None:
            output = output + self.linear.bias
        return output

Magnitude Initialization

A crucial detail in DoRA is initializing the magnitude vector from the pretrained weights:

# Initialize m as the norm of each row of W₀
m_init = W₀.norm(p=2, dim=1)

This ensures the adapted weights start proportionally scaled to the original weights.


Comparison: LoRA vs DoRA

AspectLoRADoRA
Parametersr × (d + k)r × (d + k) + d
ComponentsDirection onlyMagnitude + Direction
ExpressivityLimitedHigher (mimics full FT)
ComplexitySimpleSlightly more complex
PerformanceGoodBetter on many tasks

Usage with PEFT Library

For production use, Hugging Face’s PEFT library provides optimized implementations:

from peft import LoraConfig, get_peft_model, TaskType

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]
)

# Apply to model
model = get_peft_model(base_model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

Project Structure

lora-dora-implementation/
├── dora_implementation/
│   ├── __init__.py
│   ├── lora.py          # LoRA layer implementation
│   └── dora.py          # DoRA layer implementation
├── notebooks/
│   └── demo.ipynb       # Usage examples
├── config.py            # Configuration settings
├── pyproject.toml       # Poetry dependencies
└── README.md

Key Takeaways

  1. LoRA provides efficient fine-tuning by learning low-rank updates to frozen weights
  2. DoRA extends this by separating magnitude and direction components
  3. Both techniques dramatically reduce trainable parameters (often <1% of original)
  4. Understanding these implementations helps debug and customize PEFT strategies

References

More Projects

Explore full portfolio