Skip to content

artuppp/EllipseFitCUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


EllipseFitCUDA

A CUDA GPU-accelerated implementation for fitting ellipses to 2D point data using a least-squares approach.


🌟 Introduction

Welcome to EllipseFitCUDA! This repository provides an efficient CUDA-based solution for fitting an ellipse to a set of 2D points. The core algorithm employs a least-squares method to determine the best-fit ellipse parameters. This implementation is heavily inspired by and based on the robust MATLAB function fit_ellipse by Ohad Gal.

By leveraging the parallel processing power of NVIDIA GPUs through CUDA, CUBLAS, and CUSOLVER, this project aims to deliver high-performance ellipse fitting for applications requiring speed and accuracy.

✨ Features

  • 🚀 GPU Acceleration: Utilizes CUDA, CUBLAS, and CUSOLVER for significant performance gains in ellipse fitting.
  • 📐 Least-Squares Fitting: Implements a numerically stable least-squares approach to find optimal ellipse parameters.
  • 📄 Based on Proven Algorithm: The methodology is adapted from Ohad Gal's widely used fit_ellipse MATLAB implementation.
  • 💡 Performance Optimization Tip: Includes guidance for handling small linear systems for potentially even faster execution on specific hardware.
  • 🔧 CUDA Libraries: Relies on standard NVIDIA CUDA Toolkit libraries for linear algebra operations.

📚 Core Algorithm

The ellipse fitting is based on the direct least-squares method for conic sections. Given a set of points (x, y), we seek to find the coefficients (A, B, C, D, E, F) of the general conic equation: Ax² + Bxy + Cy² + Dx + Ey + F = 0 such that it best represents an ellipse passing through or near these points. This problem is typically formulated as a constrained least-squares problem.

This implementation draws its core logic from:

⚡ Performance Tip for Small 5x5 Systems

For solving the small 5x5 linear system that arises in this specific least-squares formulation, direct matrix operations might outperform a generic cuSolver call on some hardware. This is often due to the overhead associated with cuSolver's internal auto-tuning mechanisms for generalized matrix operations, which might not be optimal for very small, fixed-size problems.

The code block below demonstrates an alternative approach using CUBLAS for matrix multiplications and inversion to solve the system (D^T * D)^-1 * D^T * S (where D is the design matrix and S is the scatter vector, simplified here for concept).

Consider testing both this direct method and a cuSolver (e.g., cusolverDnSgesv) approach to determine the faster option for your specific GPU and use case.

// Example: Solving a 5x5 system using direct CUBLAS operations
// Assume:
// - cublas_handle is initialized
// - d_a: device pointer to the design matrix (size x 5)
// - d_b: device pointer to the scatter vector/matrix (size x 1 or 5x5 for the trick below)
// - d_a2inv: device pointer for (A^T*A) and its inverse (5x5)
// - d_x: device pointer for the solution (5x1 or 5x5)
// - size: number of points
// - alpha = 1.0f, beta = 0.0f

// Timer t; // Assuming a Timer class for benchmarking

// Solving through PSEUDOINVERSE (D^T*D)^-1 * D^T * S
// Let A_prime = D^T*D in the comments below for brevity
cudaDeviceSynchronize();
// t.start();

// Calculate D^T*D (Result is 5x5, stored in d_a2inv)
// cublasSgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)
// C = alpha*op(A)*op(B) + beta*C
// D(size x 5), D^T(5 x size) => D^T*D (5x5)
cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, 
            5, 5, size, 
            &alpha, d_design_matrix, size,  // op(D) = D^T, lda = size (leading dim of D)
            d_design_matrix, size,          // op(D) = D,   ldb = size
            &beta, d_DtD_matrix, 5);        // d_DtD_matrix (5x5), ldc = 5

// Inverse (D^T*D)^-1 (Result stored in d_DtD_inv_matrix, can be same as d_DtD_matrix if done in-place carefully or use separate)
// This example uses cublasSmatinvBatched for a single matrix; a direct LU decomposition + inversion might be more typical for a single matrix.
// For simplicity, let's assume d_DtD_inv_matrix now holds the inverse.
// Example using cublasSmatinvBatched (requires d_array_of_DtD_pointers and d_array_of_DtD_inv_pointers):
// float *A_array[1] = {d_DtD_matrix};
// float *C_array[1] = {d_DtD_inv_matrix};
// cudaMemcpy(d_A_array_dev, A_array, sizeof(float*), cudaMemcpyHostToDevice);
// cudaMemcpy(d_C_array_dev, C_array, sizeof(float*), cudaMemcpyHostToDevice);
// cublasSmatinvBatched(cublas_handle, 5, d_A_array_dev, 5, d_C_array_dev, 5, d_info, 1);

// Here's the user's provided snippet for the A^T*A based system:
// Note: The variable names (d_a, d_a2inv, d_b, d_x) in the user's snippet
// might refer to different stages or simplified representations
// of the full D, D^T*D, (D^T*D)^-1, D^T*S etc.
// The following is the user's code, adapt variable names for your full implementation.

// Assuming d_a is the "Design Matrix D" or a part of it, and d_b is related to "S"
// And the system is simplified to A*x = b where A might be D^T*D

// Calculate A_prime = A^T*A (if 'd_a' is the original design matrix D)
// Here, user's 'd_a' seems to be already a 5xN matrix.
// If 'd_a' is the (5 x num_points) design matrix D, then D^T (num_points x 5)
// D (5 x num_points) * D^T (num_points x 5) -> (5x5)
// cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, 5, 5, size, &alpha, d_a, 5, d_a, 5, &beta, d_a2inv, 5);

// Let's use the variable names from the user's example:
// d_a (5xN matrix for A), d_a2inv (5x5 for A^T*A and its inverse), d_b (5xN or 5x1 for b), d_x (5xN or 5x1 for solution)
// 'size' here seems to be 'N' (number of columns in d_a, potentially number of points if d_a is D^T)
// Constructor

// blobFinder = BlobFinder(5, 20, 4); // This line seems to be from the previous BlobFinder example, remove if not relevant to EllipseFitCUDA

// cv::Mat imageRawWB2; // OpenCV related, remove if not relevant
// cv::resize(imageRawWB, imageRawWB2, cv::Size(), 0.25, 0.25, cv::INTER_AREA); // OpenCV related
// auto reflexes = blobFinder.blob_log(imageRawWB2, 5, 20, 4, 0.1, false); // BlobFinder related
// Logger::Log("WavefrontAnalyzer::processHS", "end blob finder"); // Logging related
// std::transform(reflexes.begin(), reflexes.end(), reflexes.begin(), [](std::tuple<int, int, float>& x) { // BlobFinder related
//     std::get<0>(x) = std::get<0>(x) * 4;
//     std::get<1>(x) = std::get<1>(x) * 4;
//     std::get<2>(x) = std::get<2>(x) * 4;
//     return x;
// }); // BlobFinder related

// --- Start of User's Provided Snippet (Interpreted) ---
// Assuming d_a is a (5 x M) matrix, and d_b is a (5 x M) matrix representing M right-hand sides (or M=1 for a single vector)
// And the goal is to solve for x in (A^T*A)x = A^T*b where A is `d_a` (5xM design matrix for M points)
// and b is `d_b` (Mx1 vector of ones, usually for conic fitting Ax^2+...+F=0 form)
// Let's reinterpret 'd_a' as the Design Matrix D (M rows, 5 cols) and 'd_b' as the vector of -1s (M rows, 1 col)

// This section needs clarification on what d_a, d_b, and size represent in the context of ellipse fitting.
// The provided snippet seems to be a generic Ax=b solver using pseudoinverse for a 5-parameter system.

// If d_a is the Design Matrix (num_points x 5)
// d_b is the target vector (num_points x 1)
// d_DtD (5x5) for D^T * D
// d_DtD_inv (5x5) for (D^T * D)^-1
// d_Dt_b (5x1) for D^T * b
// d_params (5x1) for the solution (D^T*D)^-1 * D^T * b

// Calculate D^T*D
// cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, 5, 5, num_points, &alpha, d_design_matrix, num_points, d_design_matrix, num_points, &beta, d_DtD, 5);
// Invert D^T*D (using batched inverse for one matrix)
// cublasSmatinvBatched(cublas_handle, 5, &d_DtD_ptr, 5, &d_DtD_inv_ptr, 5, d_info, 1);
// Calculate D^T * b
// cublasSgemv(cublas_handle, CUBLAS_OP_T, num_points, 5, &alpha, d_design_matrix, num_points, d_b_vector, 1, &beta, d_Dt_b_vector, 1);
// Multiply (D^T*D)^-1 * (D^T*b)
// cublasSgemv(cublas_handle, CUBLAS_OP_N, 5, 5, &alpha, d_DtD_inv, 5, d_Dt_b_vector, 1, &beta, d_ellipse_params, 1);


// The user's original snippet:
// Solving through PSEUDOINVERSE (A^T*A)-1 * A^T * b (with A_prime = A^T*A)
// Here, 'd_a' likely represents the Design Matrix D.
// 'size' is likely num_points.
// 'd_a2inv' is for D^T*D and its inverse.
// 'd_b' is the vector of observations (e.g., all -1s or related to constraints).
// 'd_x' is the resulting ellipse parameters.

cudaDeviceSynchronize();
// t.start(); // Your timer start

// Calculate A^T*A (here A is d_a, result in d_a2inv)
// d_a: design matrix (num_points x 5)
// d_a2inv: (D^T*D) (5x5)
cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, 
            5, 5, num_points, /* k=num_points */
            &alpha, d_a, num_points, /* A=d_a, lda=num_points */
            d_a, num_points, /* B=d_a, ldb=num_points */
            &beta, d_a2inv, 5); /* C=d_a2inv, ldc=5 */

// Inverse (A^T*A)^-1 (in-place in d_a2inv)
// For SmatinvBatched, you need an array of pointers
float * p_d_a2inv = d_a2inv;
float * p_d_a2inv_inv = d_a2inv; // For in-place (or use a different buffer)
cudaMemcpyAsync(d_arrayOfMatrices, &p_d_a2inv, sizeof(float*), cudaMemcpyHostToDevice, stream);
cudaMemcpyAsync(d_arrayOfInverseMatrices, &p_d_a2inv_inv, sizeof(float*), cudaMemcpyHostToDevice, stream);
cublasSmatinvBatched(cublas_handle, 5, d_arrayOfMatrices, 5, d_arrayOfInverseMatrices, 5, d_info_gpu, 1);
// (Ensure d_arrayOfMatrices, d_arrayOfInverseMatrices, d_info_gpu are allocated on device)

// Temp matrix for (A^T*A)^-1 * A^T (let's call it P_temp, 5xN)
// P_temp = (d_a2inv_inv) * (d_a^T)
// d_a2inv_inv (5x5), d_a^T (5 x num_points) -> P_temp (5 x num_points)
cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, 
            5, num_points, 5, /* m=5, n=num_points, k=5 */
            &alpha, d_a2inv, 5,     /* A=d_a2inv_inv (now in d_a2inv), lda=5 */
            d_a, num_points,        /* B=d_a (use op_T), ldb=num_points */
            &beta, d_P_temp, 5);    /* C=d_P_temp, ldc=5 */ (Allocate d_P_temp: 5 * num_points)

// Multiply P_temp * b to get parameters x
// x = P_temp * d_b
// d_P_temp (5 x num_points), d_b (num_points x 1) -> d_x (5x1)
cublasSgemv(cublas_handle, CUBLAS_OP_N,
            5, num_points,      /* m=5, n=num_points */
            &alpha, d_P_temp, 5, /* A=d_P_temp, lda=5 */
            d_b, 1,              /* x=d_b (vector), incx=1 */
            &beta, d_x, 1);      /* y=d_x (vector), incy=1 */


// The "trick" mentioned by user for 5x5 matrix-vector by doing 5x5 * 5x5 and taking first column:
// This implies d_b was shaped as a 5x5 identity or similar to make the mat-mat product work.
// If d_b is indeed a 5x1 vector, Sgemv is more direct.
// If the intention was ( (A^T*A)^-1 * A^T ) * B_prime where B_prime is 5x5:
// cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 5, 5, 5, &alpha, d_P_temp_if_5x5, 5, d_b_if_5x5, 5, &beta, d_x_if_5x5, 5);
// Then extract the first column from d_x_if_5x5. This is less direct than Sgemv for a single vector 'b'.

cudaDeviceSynchronize();
// time = t.elapsed();
// cout << "Time solve system: " << time / 10e6 << " ms" << endl;
// printBMatrix<<<1, 1>>>(d_x); // Custom kernel to print result

About

Ellipse Fit Implementation in CUDA

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published