Author: Tugbars Heptaskin
Date: 2025-10-24
Last Updated: 2025-11-05
This is a highly-optimized implementation of the Savitzky-Golay filter for data smoothing and differentiation. The implementation achieves 4-6x speedup over naive approaches through a comprehensive optimization strategy while maintaining bit-exact compatibility with MATLAB's sgolayfilt function (differences on the order of 10⁻⁶).
- ✅ Production-ready performance: Multiple layers of optimization for real-world applications
- ✅ Embedded-safe: No heap allocation, no VLAs, static buffers only
- ✅ MATLAB-validated: Output matches MATLAB's Savitzky-Golay filter to floating-point precision
- ✅ Comprehensive documentation: Extensive inline documentation of mathematical foundations and optimization techniques
- ✅ Configurable: Supports both non-causal (centered) and causal (past-only) filtering modes
The Savitzky-Golay filter performs polynomial least-squares fitting on a sliding window of data points. For each window position, it fits a polynomial of degree m through 2n+1 points (where n is the half-window size), then evaluates the fitted polynomial (or its derivative) at a target point.
-
Gram Polynomials F(k,d)
- Orthogonal polynomials over discrete points [-n, ..., +n]
- Computed using a three-term recurrence relation
- The derivative order d allows computing smoothed derivatives
-
Generalized Factorial GenFact(a,b)
- Product: a × (a-1) × (a-2) × ... × (a-b+1)
- Used for normalization factors in Gram polynomials
-
Filter Weights
- Each data point gets a weight determined by Gram polynomial projections
- Weights are precomputed and reused via convolution
-
Convolution
- Filtered value: y[j] = Σ(i=0 to 2n) w[i] × x[j-n+i]
This implementation employs five major optimization techniques for a cumulative 4-6x total speedup:
- Problem: Repeated expensive factorial-like computations
- Solution: Precompute all GenFact values into a 2D table at startup
- Trade-off: ~4KB memory for O(1) lookups instead of O(n) computation
- Impact: Eliminates repeated multiplications in weight calculation
- Problem: VLAs and heap allocation problematic for embedded systems
- Solution: Fixed-size stack buffers with compile-time known maximum sizes
- Benefits:
- No stack probing overhead
- Better cache locality
- Safe for resource-constrained environments
Multiple sub-optimizations:
- Rolling buffer approach: O(d) space instead of O(k×d)
- Strength reduction: Hoist divisions (1/n) out of loops
- Branch elimination: Separate d=0 case to avoid conditionals in hot paths
- Optional memoization: Cache computed values when enabled
- Problem: Edge weights recomputed for every filter application
- Solution: Cache computed edge weights with parameter validation
- Key insight: Leading and trailing edges use same weights (different data order)
- Problem: Serial dependency chains prevent parallel execution
- Solution: Four independent accumulation chains exploit CPU superscalar execution
- Benefits:
- Intel/AMD CPUs: 2 FMA units → 2 parallel operations
- Apple M-series: 4 FMA units → 4 parallel operations
- Pairwise reduction for numerical stability
The implementation has been validated against MATLAB's sgolayfilt function:
- Accuracy: Differences on the order of 10⁻⁶ (floating-point precision limits)
- Test case: 350-point dataset, halfWindow=12, polynomialOrder=4
- See: Included validation plots showing bit-level agreement
#include "savgolFilter.h"
int main() {
// Your input data
double dataset[] = { /* your data */ };
size_t dataSize = sizeof(dataset) / sizeof(dataset[0]);
// Allocate input and output arrays
MqsRawDataPoint_t rawData[dataSize];
MqsRawDataPoint_t filteredData[dataSize];
// Initialize input data
for (size_t i = 0; i < dataSize; ++i) {
rawData[i].phaseAngle = dataset[i];
filteredData[i].phaseAngle = 0.0f;
}
// Configure filter parameters
uint8_t halfWindowSize = 12; // Window size = 2*12+1 = 25 points
uint8_t polynomialOrder = 4; // 4th-order polynomial fit
uint8_t targetPoint = 0; // Center point (non-causal)
uint8_t derivativeOrder = 0; // 0 = smoothing, 1 = 1st derivative, etc.
// Apply filter
clock_t tic = clock();
int result = mes_savgolFilter(rawData, dataSize, halfWindowSize,
filteredData, polynomialOrder,
targetPoint, derivativeOrder);
clock_t toc = clock();
if (result == 0) {
printf("Elapsed: %f seconds\n", (double)(toc - tic) / CLOCKS_PER_SEC);
// Use filteredData...
} else {
printf("Filter failed with error code: %d\n", result);
}
return 0;
}uint8_t targetPoint = 0; // Uses both past and future data- Filter centered on current point
- Uses symmetric window: [t-n, ..., t, ..., t+n]
- Best for offline processing where all data is available
- Produces smoothest results
uint8_t targetPoint = halfWindowSize; // Uses only past data- Filter uses only present and past data
- Window extends backward: [t-2n, ..., t]
- Suitable for real-time applications
- Introduces phase delay but maintains causality
uint8_t derivativeOrder = 0; // Smoothing
uint8_t derivativeOrder = 1; // First derivative (velocity)
uint8_t derivativeOrder = 2; // Second derivative (acceleration)The implementation supports optional features via preprocessor defines:
// Enable memoization (trades ~13KB memory for speed when reusing same parameters)
#define ENABLE_MEMOIZATION
// Alternative: Runtime GenFact precomputation (lower memory, specific to one filter config)
#define OPTIMIZE_GENFACTRecommendation: Use default settings (lookup table + memoization) for best performance.
The filter validates all parameters at runtime:
- Window size: (2n+1) ≤ data size
- Polynomial order: m < window size (typically m ≤ 4 for most applications)
- Target point: 0 ≤ t ≤ 2n
- Half-window size: n > 0
- Maximum window: Configurable via
MAX_WINDOW(default: 65 points)
The filter applies different strategies for three regions:
-
Leading edge (first n points):
- Uses asymmetric windows with shifted target points
- Weights applied in reverse order to available data
- Ensures smooth transition to central region
-
Central region (points n to N-n-1):
- Full symmetric windows
- Optimal smoothing/differentiation
- Most computationally efficient
-
Trailing edge (last n points):
- Reuses leading edge weights (mirrored)
- Weights applied in forward order
- Maintains consistency with leading edge
The implementation preserves numerical accuracy through:
- Careful operation ordering in floating-point arithmetic
- Pairwise reduction in accumulation chains
- No premature rounding or truncation
- Validated against MATLAB reference implementation
- Lookup table: ~4KB (GenFact table, 65×65 floats)
- Memoization cache: ~13KB (when enabled)
- Stack per call: <2KB (fixed-size buffers)
- Edge cache: <8KB (weights for edge cases)
Total: ~27KB worst-case (all features enabled)
- Signal smoothing: Remove noise while preserving signal features
- Numerical differentiation: Compute derivatives with noise suppression
- Feature extraction: Detect peaks, valleys, inflection points
- Trend analysis: Extract underlying trends from noisy data
- Real-time filtering: Causal mode for online data processing
0 : Success
-1 : NULL pointer passed
-2 : Invalid parameters (constraint violation)- Scalar implementation with comprehensive optimizations
- Production-ready, extensively documented
- Best for general-purpose use and embedded systems
Potential extensions for specialized use cases:
- SIMD vectorization (AVX/SSE) for massive datasets
- GPU acceleration for real-time multi-channel processing
- Multi-threaded batch processing
- Original paper: Savitzky, A.; Golay, M.J.E. (1964). "Smoothing and Differentiation of Data by Simplified Least Squares Procedures". Analytical Chemistry 36 (8): 1627–1639.
- MATLAB documentation: sgolayfilt
MIT License
If you use this implementation in your research, please cite:
@software{savgol_optimized_2025,
author = {Heptaskin, Tugbars},
title = {High-Performance Savitzky-Golay Filter Implementation in C},
year = {2025},
url = {[Your repository URL]}
}For detailed implementation notes, see the extensive inline documentation in src/savgolFilter.c