PySIFT

A pure-Python, open-source GPU-resident implementation of SIFT built on CuPy and Numba CUDA kernels — faster and yet more accurate. Zero-copy DLPack interop to PyTorch frees your downstream DL steps from CPU PCIe bottlenecks.

arXiv Paper GitHub PyPI Docs Kaggle HF Space
pip install staysift

What is PySIFT?

PySIFT is a pure-Python, open-source GPU-resident implementation of the Scale-Invariant Feature Transform (SIFT) built on CuPy and Numba CUDA kernels which is faster and yet more accurate. It runs the entire detection-to-descriptor pipeline on your NVIDIA GPU with zero-copy DLPack interop to PyTorch so that your downstream DL steps will be free from CPU PCIe bottlenecks.

Unlike OpenCV's SIFT (compiled C++ running on CPU), PySIFT keeps tensors in VRAM from the moment you load the image until your downstream pipeline (matching, RANSAC, bundle adjustment, rendering) consumes it. For pipelines in medical imaging, drone stitching, SLAM, robotics, and 3D reconstruction, this eliminates the bottleneck that makes CPU-based SIFT a liability.

GPU-Resident

All 5 pipeline stages run on GPU. Descriptors stay in VRAM — no CPU round-trips. Zero-copy handoff to PyTorch, CuPy, or any DLPack consumer.

Deterministic

Bitwise-identical output across runs. No non-deterministic CUDA ops. Warp-shuffle reductions replace atomicAdd for reproducibility.

Drop-in Compatible

Returns cv2.KeyPoint objects and NumPy descriptors by default. Swap cv2.SIFT_create() for PySIFT() — everything else stays the same.

Benchmark Results

Tested against OpenCV SIFT on 4 standard benchmarks (RTX 3050 Laptop GPU, 4 GB VRAM):

BenchmarkMetricPySIFTOpenCVDelta
HPatchesMMA@100.7030.681+2.2pp
IMC PhototourismInliers/pair303205+47%
MegaDepth-1500AUC@100.5030.447+5.6pp
ROxford5KmAP (Medium)0.4550.380+7.5pp

Speed

StagePySIFT (GPU)OpenCV (CPU)Speedup
Detection + Description88 ms111 ms1.26x
BF Matching (1000 kp)2.1 ms8.4 ms4.0x
End-to-End (2 images)178 ms241 ms1.35x

Quick Install

# 1. Install CUDA dependencies (order matters)
pip install cupy-cuda12x
pip install torch --index-url https://download.pytorch.org/whl/cu124

# 2. Install PySIFT
pip install staysift

The 5-Stage GPU Pipeline

PySIFT implements Lowe's SIFT algorithm (IJCV 2004) entirely on the GPU. Every stage — from image preprocessing to descriptor output — executes as CUDA kernels via CuPy and Numba. The image never leaves VRAM.

1. CLAHE
Preprocessing
2. Gaussian
Pyramid
3. DoG
Extrema
4. Orientation
Assignment
5. Descriptor
Computation

Stage 1: CLAHE Preprocessing

Adaptive histogram equalization (CLAHE) normalizes local contrast before building the scale-space pyramid. This yields 40-80% more usable keypoints on night, haze, and low-contrast images vs. raw input — critical for drone imagery at dawn/dusk and medical scans with uneven illumination.

Stage 2: Gaussian Scale-Space Pyramid

The image is progressively blurred and downsampled across octaves. PySIFT supports fp16_pyramid=True for half-precision storage (2x memory savings on VRAM-constrained GPUs like Jetson or RTX 3050), with full fp32 precision during DoG subtraction to avoid catastrophic cancellation.

Stage 3: DoG Extrema Detection

Difference-of-Gaussians (DoG) volumes are computed by subtracting adjacent Gaussian levels. Local extrema in 3D (x, y, scale) are candidate keypoints. Sub-pixel refinement via Taylor expansion and edge-response rejection (Hessian ratio test) filter the candidates.

Stage 4: Orientation Assignment

A gradient orientation histogram (36 bins) around each keypoint determines its dominant orientation, making descriptors rotation-invariant. PySIFT also supports learned orientation via OriNet (orientation='orinet').

Stage 5: Descriptor Computation

A 128-dimensional histogram of oriented gradients over a 16x16 patch (4x4 spatial bins, 8 orientation bins). PySIFT applies RootSIFT normalization by default (L1 + sqrt), which improves matching over raw L2-normalized SIFT. With dsp=True, multi-scale descriptor pooling further boosts robustness.

Zero-copy output: Use gpu_output=True to keep keypoints and descriptors as CuPy arrays in VRAM. Pass directly to GPU-based matchers without any PCIe transfer.

GPU-Residency: Why It Matters

In a typical vision pipeline (detect → match → estimate → reconstruct), CPU-based SIFT forces a GPU→CPU→GPU round-trip at every stage boundary. PySIFT eliminates this overhead entirely:

PipelinePCIe TransfersGPU Idle Time
OpenCV SIFT + PyTorch4 per image pair~40% of pipeline
PySIFT + CuPy matching0 (until final output)<5%

Keypoint Detection

PySIFT detects scale-invariant keypoints across a Gaussian pyramid. The default configuration (4 octaves, 3 scales per octave, contrast threshold 0.04) closely matches OpenCV SIFT's behavior while running entirely on the GPU.

PySIFT keypoint detection and matching on landmark images
PySIFT feature matching on landmark images. Green = PySIFT inliers, Red = OpenCV inliers.

Basic Usage

from pysift import PySIFT
import cv2

sift = PySIFT()
gray = cv2.imread("image.jpg", cv2.IMREAD_GRAYSCALE)

# Returns OpenCV-compatible KeyPoints + NumPy descriptors
keypoints, descriptors = sift.detectAndCompute(gray, None)
print(f"Detected {len(keypoints)} keypoints")
print(f"Descriptor shape: {descriptors.shape}")  # (N, 128)

GPU-Resident Output

# Keep everything on GPU as CuPy arrays
kp_gpu, desc_gpu = sift.detectAndCompute(gray, None, gpu_output=True)
# kp_gpu: CuPy (N, 4) -- [x, y, size, angle]
# desc_gpu: CuPy (N, 128) -- ready for GPU matching

Tunable Parameters

ParameterDefaultEffect
n_octaves4More octaves = more scale coverage, more keypoints
n_scales3Scale levels per octave (S in Lowe 2004)
contrast_thresh0.04Lower = more keypoints on low-contrast regions
edge_thresh10.0Higher = keep more edge-like keypoints
double_imageTrue2x upsample (octave -1), matches OpenCV default
rootsiftTrueL1+sqrt normalization (improves matching)
Adaptive contrast: PySIFT automatically adjusts contrast_thresh based on image statistics (night scenes get lower thresholds). You can override this by setting the parameter explicitly.

Feature Matching

PySIFT descriptors are L2-compatible (or RootSIFT by default), so any standard matcher works. For best results, combine Lowe's ratio test with mutual nearest-neighbor (MNN) filtering.

CPU Matching (OpenCV BFMatcher)

bf = cv2.BFMatcher(cv2.NORM_L2)
raw = bf.knnMatch(desc_a, desc_b, k=2)

# Lowe's ratio test
good = [m for m, n in raw if m.distance < 0.85 * n.distance]

GPU Matching (Zero PCIe Transfer)

When using gpu_output=True, descriptors are CuPy arrays in VRAM. Match entirely on GPU using matmul-based brute-force:

import cupy as cp

def gpu_bf_match_mutual(desc_a, desc_b, ratio=0.85):
    """Mutual nearest-neighbor matching on GPU."""
    ratio_sq = ratio * ratio

    # L2 distance matrix via matmul
    a_sq = cp.sum(desc_a * desc_a, axis=1, keepdims=True)
    b_sq = cp.sum(desc_b * desc_b, axis=1, keepdims=True).T
    dist_sq = a_sq + b_sq - 2.0 * cp.matmul(desc_a, desc_b.T)
    cp.maximum(dist_sq, 0.0, out=dist_sq)

    # Forward: Lowe's ratio test (A -> B)
    top2 = cp.argpartition(dist_sq, 1, axis=1)[:, :2]
    rows = cp.arange(dist_sq.shape[0])
    d0 = dist_sq[rows, top2[:, 0]]
    d1 = dist_sq[rows, top2[:, 1]]
    swap = d0 > d1
    best_f = cp.where(swap, top2[:, 1], top2[:, 0])
    d_best = cp.where(swap, d1, d0)
    d_second = cp.where(swap, d0, d1)
    pass_ratio = (d_best < ratio_sq * d_second) & (d_second > 0)

    # Backward: nearest neighbor (B -> A)
    best_b = cp.argmin(dist_sq, axis=0)

    # Mutual check
    mutual = pass_ratio & (best_b[best_f] == rows)
    q = cp.asnumpy(cp.where(mutual)[0])
    t = cp.asnumpy(best_f[mutual])
    return q, t

Why Mutual Nearest-Neighbor?

One-Way Matching

A→B only. A keypoint in image A finds its best match in B, but that match in B might prefer a different keypoint in A. This produces ~30-40% false matches that poison RANSAC.

Mutual NN Matching

Requires A→B and B→A to agree. Only keeps matches where both images pick each other as closest. Eliminates most outliers while preserving true matches.

RANSAC: From Matches to Geometry

# MAGSAC++ with aggressive settings
F, mask = cv2.findFundamentalMat(
    pts_a, pts_b,
    cv2.USAC_MAGSAC,
    ransacReprojThreshold=2.0,
    confidence=0.9999,
    maxIters=10000
)

DSP-SIFT: Multi-Scale Descriptor Pooling

DSP-SIFT (Dong & Soatto, CVPR 2015) computes descriptors at multiple Gaussian scale radii around each keypoint and averages the histograms. This makes each descriptor an implicit mixture over scales, dramatically improving robustness to scale estimation errors and viewpoint changes — especially important for wide-baseline stereo in 3D reconstruction and drone aerial imagery with varying altitude.

How DSP Pooling Works

  1. For each keypoint, compute the standard descriptor patch radius r
  2. Recompute the descriptor at 3 scaled radii: r/√2, r, r√2
  3. Average the 3 histograms element-wise
  4. Apply L2 + clip(0.2) + L2 normalization (then RootSIFT)

The output is still a 128-dimensional descriptor — same size, better quality. No change to matching code required.

Enable DSP-SIFT

# One flag enables DSP pooling
sift = PySIFT(dsp=True)

# Everything else stays the same
kp, desc = sift.detectAndCompute(gray, None)
# desc.shape = (N, 128) -- same dimensionality

Impact on Accuracy

BenchmarkStandard SIFTDSP-SIFTDelta
IMC Phototourism (inliers)245303+24%
MegaDepth AUC@100.4570.503+4.6pp
HPatches MMA@100.6950.703+0.8pp
When to use DSP: Always enable for matching tasks (image retrieval, SfM, visual localization). The overhead is small (<10% slower) and the accuracy gain is consistent across benchmarks. Skip only if you need maximum detection speed and matching quality is not critical.

GPU Implementation

PySIFT fuses all DSP scales into a single CUDA kernel launch per octave. Each thread block processes one keypoint, looping over the 3 scale multipliers and accumulating into shared memory. This avoids the naive approach of launching 3 separate descriptor kernels (3x overhead).

Who is PySIFT For?

Every vision pipeline that matches images — from surgical navigation to autonomous flight — starts with feature extraction. PySIFT replaces the CPU bottleneck at the foundation of these pipelines with a GPU-resident alternative that requires no retraining, no C++ compilation, and no domain-specific tuning.

🏥

Medical Imaging

Register histopathology whole-slide images, align retinal fundus scans, fuse multi-modal MRI/CT volumes. SIFT's physics-based scale-space is domain-agnostic — no retraining needed on medical data, unlike learned detectors (SuperPoint) trained on street-level photos.

# Register two histopathology slide regions
from pysift import PySIFT
sift = PySIFT(dsp=True)
kp1, d1 = sift.detectAndCompute(slide_a)
kp2, d2 = sift.detectAndCompute(slide_b)
medical image registration · histopathology · retinal alignment · multi-modal fusion
🚁

Drone / UAV Stitching

Real-time aerial mosaic construction on edge GPUs. PySIFT runs on 4 GB VRAM (Jetson Orin compatible), handles altitude-varying scale changes via DSP-SIFT pooling, and produces deterministic output for certifiable flight systems.

# Stitch two aerial frames on Jetson
from pysift import GPUPyStitch
stitcher = GPUPyStitch()
panorama = stitcher.stitch(frame_left, frame_right)
drone image stitching · UAV mosaic · aerial survey · edge GPU · Jetson
🤖

SLAM & Visual Odometry

GPU-resident features feed directly into visual odometry without PCIe stall. Zero-copy DLPack handoff means PySIFT descriptors are immediately available to PyTorch-based pose estimators, loop closure detectors, and map optimizers.

# GPU-resident features for VO pipeline
kp, desc = sift.detectAndCompute(
    frame, None, gpu_output=True
)
# desc is CuPy array in VRAM -- zero-copy
torch_desc = torch.from_dlpack(desc)
visual SLAM · visual odometry · loop closure · pose estimation · real-time localization
🦾

Robotics

Pick-and-place object localization, bin picking, visual servoing. Deterministic output means identical features on identical input — repeatable grasps, auditable decisions, and regression-testable perception stacks.

robotic vision · bin picking · visual servoing · object localization · deterministic perception
🏗

3D Reconstruction (NeRF / 3DGS / SfM)

Every NeRF and 3D Gaussian Splatting pipeline starts with COLMAP, which uses CPU SIFT. PySIFT is a drop-in replacement that eliminates the COLMAP preprocessing bottleneck. GPU-resident features flow directly into bundle adjustment without PCIe transfers.

# Replace COLMAP's CPU SIFT extraction
sift = PySIFT(dsp=True)
for img in scene_images:
    kp, desc = sift.detectAndCompute(
        img, None, gpu_output=True
    )
NeRF preprocessing · 3D Gaussian Splatting · COLMAP GPU · Structure from Motion · bundle adjustment
🛰

Satellite & Remote Sensing

Gigapixel mosaics from satellite strips. PySIFT's speed advantage grows with resolution (3.2x faster at 4K, expected 4-5x at 8K). No retraining needed — SIFT's scale-space theory is inherently domain-agnostic, working on any spectral band.

satellite image matching · remote sensing registration · gigapixel mosaic · multi-spectral alignment

Why PySIFT Over Learned Detectors for Downstream Pipelines?

PropertyPySIFT (Classical GPU SIFT)SuperPoint / Learned
Domain transferWorks on any imagery (medical, satellite, microscopy)Trained on MS-COCO street photos — degrades out-of-domain
DeterminismBitwise identical across runsGPU float non-determinism
Scale invarianceTrue multi-scale via DoG pyramidSingle 8x grid, no scale-space
Model weights0 MB (physics-based)614 MB (SuperPoint)
CertifiabilityDeterministic + inspectable = auditableBlack box
4GB GPU supportYes (tested on RTX 3050)Tight on 4GB with large images

Try PySIFT

Kaggle (Free GPU)

Run PySIFT on a free T4 GPU. Our competition notebook processes 4,941 image pairs in ~30 minutes.

Open Competition Notebook →

Local Install

Requires NVIDIA GPU with CUDA 12.x. Tested on RTX 3050 (4GB), RTX 4090 (24GB), Tesla T4 (16GB).

pip install staysift

Full Pipeline Example

from pysift import PySIFT
import cv2
import numpy as np

# 1. Detect with DSP-SIFT
sift = PySIFT(dsp=True)
img_a = cv2.imread("scene_left.jpg", cv2.IMREAD_GRAYSCALE)
img_b = cv2.imread("scene_right.jpg", cv2.IMREAD_GRAYSCALE)

kp_a, desc_a = sift.detectAndCompute(img_a, None)
kp_b, desc_b = sift.detectAndCompute(img_b, None)

# 2. Match with ratio test
bf = cv2.BFMatcher(cv2.NORM_L2)
matches = bf.knnMatch(desc_a, desc_b, k=2)
good = [m for m, n in matches if m.distance < 0.85 * n.distance]

# 3. Estimate fundamental matrix
pts_a = np.float32([kp_a[m.queryIdx].pt for m in good])
pts_b = np.float32([kp_b[m.trainIdx].pt for m in good])

F, mask = cv2.findFundamentalMat(
    pts_a, pts_b, cv2.USAC_MAGSAC, 2.0, 0.9999, 10000
)
inliers = mask.ravel().sum()
print(f"Matches: {len(good)}, Inliers: {inliers}")

Advanced: GPU-Resident Pipeline

import cupy as cp

# Detect -- descriptors stay in VRAM
kp_a, desc_a = sift.detectAndCompute(img_a, None, gpu_output=True)
kp_b, desc_b = sift.detectAndCompute(img_b, None, gpu_output=True)

# Match -- GPU matmul, zero PCIe transfer
q, t = gpu_bf_match_mutual(desc_a, desc_b, ratio=0.85)

# Only small coordinate arrays transfer to CPU for RANSAC
pts_a = cp.asnumpy(kp_a[q, :2])
pts_b = cp.asnumpy(kp_b[t, :2])

Resources

ResourceLink
arXiv Paper arxiv.org/abs/2605.17869
GitHub Repository github.com/SivaIITM/PySIFT
PyPI Package pypi.org/project/staysift
Documentation pysift-gpu.readthedocs.io
Kaggle Competition IMC 2026 Warm-Up Sprint
HuggingFace Space PySIFT Interactive Demo
GPU required: PySIFT needs an NVIDIA GPU with CUDA 11.x or 12.x. For CPU-only environments, use Kaggle (free T4 GPU) or Google Colab. pip install staysift on a CPU-only machine will install but fail at runtime.