EllipseFitCUDA

A CUDA GPU-accelerated implementation for fitting ellipses to 2D point data using a least-squares approach.

🌟 Introduction

Welcome to EllipseFitCUDA! This repository provides an efficient CUDA-based solution for fitting an ellipse to a set of 2D points. The core algorithm employs a least-squares method to determine the best-fit ellipse parameters. This implementation is heavily inspired by and based on the robust MATLAB function fit_ellipse by Ohad Gal.

By leveraging the parallel processing power of NVIDIA GPUs through CUDA, CUBLAS, and CUSOLVER, this project aims to deliver high-performance ellipse fitting for applications requiring speed and accuracy.

✨ Features

🚀 GPU Acceleration: Utilizes CUDA, CUBLAS, and CUSOLVER for significant performance gains in ellipse fitting.
📐 Least-Squares Fitting: Implements a numerically stable least-squares approach to find optimal ellipse parameters.
📄 Based on Proven Algorithm: The methodology is adapted from Ohad Gal's widely used fit_ellipse MATLAB implementation.
💡 Performance Optimization Tip: Includes guidance for handling small linear systems for potentially even faster execution on specific hardware.
🔧 CUDA Libraries: Relies on standard NVIDIA CUDA Toolkit libraries for linear algebra operations.

📚 Core Algorithm

The ellipse fitting is based on the direct least-squares method for conic sections. Given a set of points (x, y), we seek to find the coefficients (A, B, C, D, E, F) of the general conic equation: Ax² + Bxy + Cy² + Dx + Ey + F = 0 such that it best represents an ellipse passing through or near these points. This problem is typically formulated as a constrained least-squares problem.

This implementation draws its core logic from:

Gal, O. (2023). fit_ellipse (www.mathworks.com/matlabcentral/fileexchange/3215-fit_ellipse), MATLAB Central File Exchange. Retrieved Date.

⚡ Performance Tip for Small 5x5 Systems

For solving the small 5x5 linear system that arises in this specific least-squares formulation, direct matrix operations might outperform a generic cuSolver call on some hardware. This is often due to the overhead associated with cuSolver's internal auto-tuning mechanisms for generalized matrix operations, which might not be optimal for very small, fixed-size problems.

The code block below demonstrates an alternative approach using CUBLAS for matrix multiplications and inversion to solve the system (D^T * D)^-1 * D^T * S (where D is the design matrix and S is the scatter vector, simplified here for concept).

Consider testing both this direct method and a cuSolver (e.g., cusolverDnSgesv) approach to determine the faster option for your specific GPU and use case.

// Example: Solving a 5x5 system using direct CUBLAS operations
// Assume:
// - cublas_handle is initialized
// - d_a: device pointer to the design matrix (size x 5)
// - d_b: device pointer to the scatter vector/matrix (size x 1 or 5x5 for the trick below)
// - d_a2inv: device pointer for (A^T*A) and its inverse (5x5)
// - d_x: device pointer for the solution (5x1 or 5x5)
// - size: number of points
// - alpha = 1.0f, beta = 0.0f

// Timer t; // Assuming a Timer class for benchmarking

// Solving through PSEUDOINVERSE (D^T*D)^-1 * D^T * S
// Let A_prime = D^T*D in the comments below for brevity
cudaDeviceSynchronize();
// t.start();

// Calculate D^T*D (Result is 5x5, stored in d_a2inv)
// cublasSgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)
// C = alpha*op(A)*op(B) + beta*C
// D(size x 5), D^T(5 x size) => D^T*D (5x5)
cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, 
            5, 5, size, 
            &alpha, d_design_matrix, size,  // op(D) = D^T, lda = size (leading dim of D)
            d_design_matrix, size,          // op(D) = D,   ldb = size
            &beta, d_DtD_matrix, 5);        // d_DtD_matrix (5x5), ldc = 5

// Inverse (D^T*D)^-1 (Result stored in d_DtD_inv_matrix, can be same as d_DtD_matrix if done in-place carefully or use separate)
// This example uses cublasSmatinvBatched for a single matrix; a direct LU decomposition + inversion might be more typical for a single matrix.
// For simplicity, let's assume d_DtD_inv_matrix now holds the inverse.
// Example using cublasSmatinvBatched (requires d_array_of_DtD_pointers and d_array_of_DtD_inv_pointers):
// float *A_array[1] = {d_DtD_matrix};
// float *C_array[1] = {d_DtD_inv_matrix};
// cudaMemcpy(d_A_array_dev, A_array, sizeof(float*), cudaMemcpyHostToDevice);
// cudaMemcpy(d_C_array_dev, C_array, sizeof(float*), cudaMemcpyHostToDevice);
// cublasSmatinvBatched(cublas_handle, 5, d_A_array_dev, 5, d_C_array_dev, 5, d_info, 1);

// Here's the user's provided snippet for the A^T*A based system:
// Note: The variable names (d_a, d_a2inv, d_b, d_x) in the user's snippet
// might refer to different stages or simplified representations
// of the full D, D^T*D, (D^T*D)^-1, D^T*S etc.
// The following is the user's code, adapt variable names for your full implementation.

// Assuming d_a is the "Design Matrix D" or a part of it, and d_b is related to "S"
// And the system is simplified to A*x = b where A might be D^T*D

// Calculate A_prime = A^T*A (if 'd_a' is the original design matrix D)
// Here, user's 'd_a' seems to be already a 5xN matrix.
// If 'd_a' is the (5 x num_points) design matrix D, then D^T (num_points x 5)
// D (5 x num_points) * D^T (num_points x 5) -> (5x5)
// cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, 5, 5, size, &alpha, d_a, 5, d_a, 5, &beta, d_a2inv, 5);

// Let's use the variable names from the user's example:
// d_a (5xN matrix for A), d_a2inv (5x5 for A^T*A and its inverse), d_b (5xN or 5x1 for b), d_x (5xN or 5x1 for solution)
// 'size' here seems to be 'N' (number of columns in d_a, potentially number of points if d_a is D^T)
// Constructor

// blobFinder = BlobFinder(5, 20, 4); // This line seems to be from the previous BlobFinder example, remove if not relevant to EllipseFitCUDA

// cv::Mat imageRawWB2; // OpenCV related, remove if not relevant
// cv::resize(imageRawWB, imageRawWB2, cv::Size(), 0.25, 0.25, cv::INTER_AREA); // OpenCV related
// auto reflexes = blobFinder.blob_log(imageRawWB2, 5, 20, 4, 0.1, false); // BlobFinder related
// Logger::Log("WavefrontAnalyzer::processHS", "end blob finder"); // Logging related
// std::transform(reflexes.begin(), reflexes.end(), reflexes.begin(), [](std::tuple<int, int, float>& x) { // BlobFinder related
//     std::get<0>(x) = std::get<0>(x) * 4;
//     std::get<1>(x) = std::get<1>(x) * 4;
//     std::get<2>(x) = std::get<2>(x) * 4;
//     return x;
// }); // BlobFinder related

// --- Start of User's Provided Snippet (Interpreted) ---
// Assuming d_a is a (5 x M) matrix, and d_b is a (5 x M) matrix representing M right-hand sides (or M=1 for a single vector)
// And the goal is to solve for x in (A^T*A)x = A^T*b where A is `d_a` (5xM design matrix for M points)
// and b is `d_b` (Mx1 vector of ones, usually for conic fitting Ax^2+...+F=0 form)
// Let's reinterpret 'd_a' as the Design Matrix D (M rows, 5 cols) and 'd_b' as the vector of -1s (M rows, 1 col)

// This section needs clarification on what d_a, d_b, and size represent in the context of ellipse fitting.
// The provided snippet seems to be a generic Ax=b solver using pseudoinverse for a 5-parameter system.

// If d_a is the Design Matrix (num_points x 5)
// d_b is the target vector (num_points x 1)
// d_DtD (5x5) for D^T * D
// d_DtD_inv (5x5) for (D^T * D)^-1
// d_Dt_b (5x1) for D^T * b
// d_params (5x1) for the solution (D^T*D)^-1 * D^T * b

// Calculate D^T*D
// cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, 5, 5, num_points, &alpha, d_design_matrix, num_points, d_design_matrix, num_points, &beta, d_DtD, 5);
// Invert D^T*D (using batched inverse for one matrix)
// cublasSmatinvBatched(cublas_handle, 5, &d_DtD_ptr, 5, &d_DtD_inv_ptr, 5, d_info, 1);
// Calculate D^T * b
// cublasSgemv(cublas_handle, CUBLAS_OP_T, num_points, 5, &alpha, d_design_matrix, num_points, d_b_vector, 1, &beta, d_Dt_b_vector, 1);
// Multiply (D^T*D)^-1 * (D^T*b)
// cublasSgemv(cublas_handle, CUBLAS_OP_N, 5, 5, &alpha, d_DtD_inv, 5, d_Dt_b_vector, 1, &beta, d_ellipse_params, 1);


// The user's original snippet:
// Solving through PSEUDOINVERSE (A^T*A)-1 * A^T * b (with A_prime = A^T*A)
// Here, 'd_a' likely represents the Design Matrix D.
// 'size' is likely num_points.
// 'd_a2inv' is for D^T*D and its inverse.
// 'd_b' is the vector of observations (e.g., all -1s or related to constraints).
// 'd_x' is the resulting ellipse parameters.

cudaDeviceSynchronize();
// t.start(); // Your timer start

// Calculate A^T*A (here A is d_a, result in d_a2inv)
// d_a: design matrix (num_points x 5)
// d_a2inv: (D^T*D) (5x5)
cublasSgemm(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, 
            5, 5, num_points, /* k=num_points */
            &alpha, d_a, num_points, /* A=d_a, lda=num_points */
            d_a, num_points, /* B=d_a, ldb=num_points */
            &beta, d_a2inv, 5); /* C=d_a2inv, ldc=5 */

// Inverse (A^T*A)^-1 (in-place in d_a2inv)
// For SmatinvBatched, you need an array of pointers
float * p_d_a2inv = d_a2inv;
float * p_d_a2inv_inv = d_a2inv; // For in-place (or use a different buffer)
cudaMemcpyAsync(d_arrayOfMatrices, &p_d_a2inv, sizeof(float*), cudaMemcpyHostToDevice, stream);
cudaMemcpyAsync(d_arrayOfInverseMatrices, &p_d_a2inv_inv, sizeof(float*), cudaMemcpyHostToDevice, stream);
cublasSmatinvBatched(cublas_handle, 5, d_arrayOfMatrices, 5, d_arrayOfInverseMatrices, 5, d_info_gpu, 1);
// (Ensure d_arrayOfMatrices, d_arrayOfInverseMatrices, d_info_gpu are allocated on device)

// Temp matrix for (A^T*A)^-1 * A^T (let's call it P_temp, 5xN)
// P_temp = (d_a2inv_inv) * (d_a^T)
// d_a2inv_inv (5x5), d_a^T (5 x num_points) -> P_temp (5 x num_points)
cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, 
            5, num_points, 5, /* m=5, n=num_points, k=5 */
            &alpha, d_a2inv, 5,     /* A=d_a2inv_inv (now in d_a2inv), lda=5 */
            d_a, num_points,        /* B=d_a (use op_T), ldb=num_points */
            &beta, d_P_temp, 5);    /* C=d_P_temp, ldc=5 */ (Allocate d_P_temp: 5 * num_points)

// Multiply P_temp * b to get parameters x
// x = P_temp * d_b
// d_P_temp (5 x num_points), d_b (num_points x 1) -> d_x (5x1)
cublasSgemv(cublas_handle, CUBLAS_OP_N,
            5, num_points,      /* m=5, n=num_points */
            &alpha, d_P_temp, 5, /* A=d_P_temp, lda=5 */
            d_b, 1,              /* x=d_b (vector), incx=1 */
            &beta, d_x, 1);      /* y=d_x (vector), incy=1 */


// The "trick" mentioned by user for 5x5 matrix-vector by doing 5x5 * 5x5 and taking first column:
// This implies d_b was shaped as a 5x5 identity or similar to make the mat-mat product work.
// If d_b is indeed a 5x1 vector, Sgemv is more direct.
// If the intention was ( (A^T*A)^-1 * A^T ) * B_prime where B_prime is 5x5:
// cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 5, 5, 5, &alpha, d_P_temp_if_5x5, 5, d_b_if_5x5, 5, &beta, d_x_if_5x5, 5);
// Then extract the first column from d_x_if_5x5. This is less direct than Sgemv for a single vector 'b'.

cudaDeviceSynchronize();
// time = t.elapsed();
// cout << "Time solve system: " << time / 10e6 << " ms" << endl;
// printBMatrix<<<1, 1>>>(d_x); // Custom kernel to print result

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
EllipseFit.cu		EllipseFit.cu
Makefile		Makefile
README.md		README.md
Timer.cpp		Timer.cpp
ellipseTest.out		ellipseTest.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EllipseFitCUDA

🌟 Introduction

✨ Features

📚 Core Algorithm

⚡ Performance Tip for Small 5x5 Systems

About

Uh oh!

Releases

Packages

Languages

artuppp/EllipseFitCUDA

Folders and files

Latest commit

History

Repository files navigation

EllipseFitCUDA

🌟 Introduction

✨ Features

📚 Core Algorithm

⚡ Performance Tip for Small 5x5 Systems

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages