Automatic Near Real-Time Outlier Detection and Correction in Cardiac Interbeat Interval Series for Heart Rate Variability Analysis: Singular Spectrum Analysis-Based Approach

Background: Heart rate variability (HRV) is derived from the series of R-R intervals extracted from an electrocardiographic (ECG) measurement. Ideally all components of the R-R series are the result of sinoatrial node depolarization. However, the actual R-R series are contaminated by outliers due to heart rhythm disturbances such as ectopic beats, which ought to be detected and corrected appropriately before HRV analysis. Objective: We have introduced a novel, lightweight, and near real-time method to detect and correct anomalies in the R-R series based on the singular spectrum analysis (SSA). This study aimed to assess the performance of the proposed method in terms of (1) detection performance (sensitivity, specificity, and accuracy); (2) root mean square error (RMSE) between the actual N-N series and the approximated outlier-cleaned R-R series; and (3) how it benchmarks against a competitor in terms of the relative RMSE. Methods: A lightweight SSA-based change-point detection procedure, improved through the use of a cumulative sum control chart with adaptive thresholds to reduce detection delays, monitored the series of R-R intervals in real time. Upon detection of an anomaly, the corrupted segment was substituted with the respective outlier-cleaned approximation obtained using recurrent SSA forecasting. Next, N-N intervals from a 5-minute ECG segment were extracted from each of the 18 records in the MIT-BIH Normal Sinus Rhythm Database. Then, for each such series, a number (randomly drawn integer between 1 and 6) of simulated ectopic beats were inserted at random positions within the series and results were averaged over 1000 Monte Carlo runs.


Background on Heart Rate Variability
Oscillations in the time interval between successive heart beats, referred to as heart rate variability (HRV), have long been known to allow for insight into the intricate control mechanisms of the autonomic nervous system (ANS) [1][2][3][4]. Accordingly, research into HRV has attracted considerable attention over the past 4 decades, as evidenced by an exponential growth of published work [5,6].
In a nutshell, heart rhythm and rate can be said to be governed by the 2 competing divisions of the ANS, that is, the sympathetic nervous system and the parasympathetic nervous system. An increased sympathetic input to the sinoatrial (SA) node yields an increase in heart rate (HR) mediated by the release of adrenaline and noradrenaline and the subsequent activation of β-1-adrenoceptors, resulting in an accelerated diastolic depolarization. On the other hand, an increase in the parasympathetic outflow decreases the HR through the release of acetylcholine by the vagus nerve, which binds to and activates M2 muscarinic acetylcholine receptors in the SA node and eventually slows down the diastolic depolarization rate [4,[7][8][9].
Various HRV measures have been established and are usually categorized as either time domain, frequency domain, or nonlinear analysis techniques [2,4]. To a large extent, the popularity of HRV is because of its easy acquisition and seemingly straightforward interpretation.

The Necessity of Preprocessing the R-R Series
It is crucial to realize that by virtue of its very definition, the HRV encompasses only oscillatory phenomena between heart beats resulting from the SA node depolarization [4,10]. While, ideally, all components of an R-R series originate from the SA node, the actual R-R series in both healthy and diseased subjects are contaminated by outliers due to artifacts and heart rhythm disturbances such as ectopic beats (ie, heart beats whose origin is outside of the SA node). Thus arises the necessity to ensure that the HRV analysis is performed on a series representing only the actual normal sinus rhythm (NSR) interbeat intervals, commonly referred to as N-N series (as in normal-to-normal).
The detrimental impact of ectopics on HRV measures is pronounced and well-documented [6,[11][12][13][14][15][16][17]. In a recent study Stapelberg et al [18] examined the sensitivity of 38 time domain, frequency domain and nonlinear HRV measures to the addition of artifacts in real and artificial 24-hour recordings. In accordance with previous findings, they concluded short-term time domain HRV measures to be more sensitive to the presence of artifacts than their long-term counterparts. Furthermore, frequency domain measures were found to generally be more sensitive than time domain measures, whereas the less commonly used nonlinear measures were shown to exhibit some inherent robustness properties.
Ectopic heart beats are ubiquitous phenomena and not limited to patients with cardiac disorders and diseases. Hingorani et al [19] retrospectively examined the prevalence of cardiac arrhythmias by scrutinizing 24-hour Holter recordings of 1273 healthy volunteers from 22 phase I clinical trials; note that this specific population is not a representative sample of the overall population. The sample was heavily biased in that, consistent with the requirements of phase I clinical trials, subjects were extensively screened by history, physical examination, and laboratory tests and were therefore significantly healthier than the overall population. Crucially, all subjects underwent 12-lead pretrial electrocardiography (ECG) and were excluded if they had any cardiac conduction disorder or disease, including a personal or familiar history thereof. Those exhibiting ≥2 consecutive ectopics as well as bigeminy, trigeminy and quadrigeminy in their 12-lead ECG examination were also excluded. Despite these rigorous exclusion criteria, Hingorani et al [19], among other findings, found premature atrial complexes in 60.8% of the examined Holter records, followed by premature ventricular complexes (PVCs), which were observed in 43.4% of healthy volunteers. While multifocal PVCs occurred in 5.3%, 3.3% exhibited >200 PVCs per 24-hour ECG recordings. A relatively high prevalence of occasional ectopic beats has been reported by numerous other authors as well both in healthy and diseased subjects [20][21][22][23].
In summary, the detection and correction of ectopic beats in tachograms are not merely imperative for formal consistency with the HRV definition but arises from the impact of ectopics on HRV measures and the general prevalence of ectopics.

Prior Work and State-of-the-Art
Compared with the overall research interest directed toward HRV applications, the crucial issue of the discrepancy between most real-world R-R series and their respective N-N series has arguably not received as much attention. The actual gold standard of manual R-R series editing, advocated for in the field's relevant guideline by the European Society of Cardiology and the North American Society of Pacing and Electrophysiology Task Force [4], remains unaltered to this day [11,24].
It should be noted that, aiding and abetting the aforementioned exponential growth of the HRV-related (applied) research, a number of well-designed and widely used software packages for the HRV analysis have been developed and made available to the general public; these include, but are not limited to, Kubios [25,26], Nevrokard's aHRV [27], and others [28][29][30]. More often than not, these commercial software packages tend to be closed source and therefore provide only a limited benefit to algorithm developers. On the other hand, it ought to be acknowledged that they are well suited for the needs of practitioners, as they combine advanced preprocessing and analysis features, intuitive graphical interfaces, and support for various file formats. In automated approaches the task of the detection and correction of outliers is usually tackled as a two-step process.
First, outliers in the R-R tachogram are detected, usually by straightforward thresholding based on a fixed percentage difference to previous intervals or their SD [31], although some more sophisticated methods have also been proposed [12].
Second, the detected (regions of) artifacts from step 1 are either discarded (thereby reducing the effective length of the tachogram), substituted by preceding uncorrupted intervals or by intervals obtained through interpolation, with linear and cubic spline interpolation being the most popular techniques [11].

Objective of This Work
The objective of this study was to introduce a general framework of low computational complexity that allows for an automatic near real-time detection and correction of outliers in the R-R series based on the singular spectrum analysis (SSA). The main novel contributions of this work include (1) the application of a recently proposed model-free lightweight SSA change-point detection (l-SSA-CPD) algorithm [32]; (2) a modification of l-SSA-CPD through the use of an adaptive control limit sequential ranks (AC-SRC) control chart [33] to drastically reduce detection delays; and (3) upon detection of an anomaly, the substitution of the corrupted tachogram segment by an approximation obtained through recurrent SSA forecasting based on a small outlier-free tachogram segment.
An extensive simulation study comprising 198,000 5-minute long R-R series, obtained by randomly inserting varying amounts of simulated ectopic beats in records taken from the MIT-BIH Normal Sinus Rhythm Database (NSRDB), is conducted to assess the artifact detection and correction performance of the proposed algorithm.

Fundamentals of Basic Univariate Singular Spectrum Analysis and Singular Spectrum Analysis Based Change-Point Detection
Recently, this author proposed a low-complexity model-free approach based on SSA and nonparametric cumulative sum (CUSUM) control charts for real-time cardiac anomaly detection, referred to as l-SSA-CPD [32]. It was shown that l-SSA-CPD reliably detects anomalies even when directly applied to unprocessed (ie, no preprocessing was performed) raw ECG and photoplethysmographic records from common databases publicly available through Physiobank [34].
In this study, modifications to the original l-SSA-CPD algorithm will be introduced, and its capabilities will be expanded by adding a recurrent forecasting feature to generate appropriate substitutes of the corrupted tachogram entries. First, however, a brief restatement of SSA and l-SSA-CPD [32] is required.

Basic Singular Spectrum Analysis Algorithm
SSA is a powerful technique of time series analysis owing much of its appeal to an inherent model-free concept combining elements of conventional time series analysis, multivariate geometry and statistics, and signal processing [35]. The univariate basic SSA can be considered as the application of the principal component analysis to the Hankel matrix (obtained through an embedding of the original univariate time series) with the subsequent attempt to reconstruct the original series.
Consider N observations = (x 1 ,..., x N ) of a univariate time series and a lag-integer or embedding dimension M (1 < M < < N) also commonly referred to as the window length. The basic SSA algorithm then encompasses the following 4 steps: A trajectory matrix Χ is constructed by mapping into a sequence of K = N − M + 1 lagged column vectors X j = (x j , ...,

Diagonal Averaging
For perfectly separable signal components, all matrices in the expansion of Equation (5) are Hankel matrices, which require that they all have equal entries on their antidiagonals. In real-world problems, however, perfect separability is rarely achievable. While an approximate separability usually suffices, the final fourth step in the SSA algorithm is required, as the matrices Χ i from step 3 would have unequal entries on their antidiagonals. Thus, the Hankelization of all Χ i is performed, that is, the compliance with the Hankel structure is enforced by taking the average of each antidiagonal and then replacing each element of that antidiagonal with the determined average. This yields with being the diagonally averaged matrices.
One can then, for example, easily reconstruct the approximation of the signal of interest through the eigentriples with indices I through the one-to-one correspondence between and the respective time series which provides an approximation of either the entire time series or some components of it, depending on the particular choice of indices I.

Singular Spectrum Analysis Based Change-Point Detection Rationale
The sequential application of SSA for change detection is based on Moskvina and Zhigljavsky [40] (see [39] for an exhaustive discussion; see also [35] for an earlier account of the basic idea).
The rationale is to slide 2 (possibly intersecting) moving windows over the time series and to apply the first 3 stages of the basic SSA each time. Think of the first window as adaptively generating a low-dimensional base subspace representation (capturing the main structure in the series) and the second window, which contains at least M new data samples, as generating a test subspace. As both windows are slid over the series, some sort of distance between the base and the test subspace is monitored. If no change-point occurs, the distance remains small, whereas it spikes in the presence of a sudden significant change in the main structure of the series.
The algorithm by Moskvina and Zhigljavsky has been successfully applied to various change detection problems and some improved variations of it have been presented as well (see [32] for a detailed background). Computational complexity, however, ought to be acknowledged as a potentially significant limitation of the said algorithm. Recently, l-SSA-CPD [32] has been proposed as a lightweight alternative. Contrary to the algorithm by Moskvina and Zhigljavsky, which recalculates the SVD of 2 sliding windows each time new observations become available, l-SSA-CPD relies on a nominal low-dimensional signal subspace representation computed only once at startup. As new observations become available, l-SSA-CPD simply allocates them into a vector and calculates a test statistic based on the squared Euclidean distance and the angle between this new data vector and the previously determined nominal reference subspace.

Singular Spectrum Analysis Parameter Selection
The behavior of SSA is largely determined by 2 tuning parameters, namely the window length M and the number l of eigenvalues to be used in the embedding and grouping steps, respectively. While it is undisputed that the successful application of SSA hinges on an appropriate choice of M and l, practitioners are faced with the nuisance that no clear-cut rules as to how they ought to be determined exist.
For a series of length N and window length (or embedding dimension) 1 < M < N, as previously discussed, there is no additional benefit from using M > N / 2 (see, eg, [35] at 69 and [36] at 47). It is widely acknowledged that as the appropriate window length M depends on both the underlying data and the particular application, no universal rule for determining it exists [35][36][37][38][39]. General recommendations vary between M N / 4 [37] and M N / 2 [35,36]. In the case of a periodic signal with period T, it is important for M to be, at least, equal to T for the SSA to capture the main structure of the series. Furthermore, M should be proportional to T [35,36,38]. Of note, small windows act like a smoothing filter of width 2M − 1 [36].
Similarly, it is well known that selecting a proper l is also crucial, for issues are known to arise when eigentriple grouping comprises either an insufficient or an excessive number of eigenvalues. More specifically, if l is too small, the SSA is unable to capture the entire signal whereas if it is too large noise components are unwittingly captured. Since the contribution of a particular series component (associated with the corresponding eigentriple) to the entire series is proportional to the respective eigenvalue, it is common practice to select l by identifying the leading eigenvalues through visual inspection of the eigenvalue spectra or by setting a predefined eigenvalue share (see, eg, [35,36,38] for a more exhaustive discussion).
In addition to the above-referred monographs, interested readers are also referred to some fairly recent contributions by Hassani et al [41] and Khan and Poskitt [42,43], which approach the issue through the concept of separability and information theory, respectively.
In the light of the pragmatic suboptimal approach pursued here, a detailed discussion of the SSA parameter selection is, however, beyond the scope of this study and therefore omitted.

The Proposed Outlier Detection and Correction Framework
The proposed framework consisted of 2 separate stages, namely the detection and the correction of outlier corrupted segments of the tachogram.

Outlier Detection Using Lightweight Singular Spectrum Analysis Change-Point Detection and Adaptive Control Limit Sequential Ranks Control Charts
Let M, N, l, p, q be fixed integers such that l < M < N / 2 and 0 ≤ p < q. Then l-SSA-CPD proceeds as follows: 1. Initialization at n = 0 Akin to MZ, the first 3 steps of the basic SSA are applied on the interval [n + 1, n + N] to obtain a low-dimensional nominal subspace I = I (n=0) which captures the main structure of the series. This involves embedding as in Equation (1) to obtain a trajectory matrix, which, hereafter, we refer to as the base matrix M × l matrix of eigenvectors spanning I and º denotes the Hadamard (or element-wise) product. Note that throughout this study, p = N, q = N + 1 is used, that is, the width Q = q × p of the test matrices Χ T (n) used always equals 1. In other words, each test matrix is actually a single column vector containing M new observations. Q = 1 is primarily chosen to allow for an agile response time and to minimize the computational burden [32].
• Monitoring of D n,I,p,q using the AC-SRC Control Chart To reduce detection delays, instead of the sequential ranks CUSUM (SRC) by McDonald [44], as used in the conventional l-SSA-CPD [32], a sequential ranks-based CUSUM with adaptive control limits referred to as AC-SRC [33] is used. AC-SRC is inspired by a distribution-free CUSUM proposed by Chatterjee and Qiu [45] where instead of a fixed threshold a sequence of control limits, determined by the bootstrap estimate of the conditional distribution of the CUSUM test statistic given the last time it was zero, is used. AC-SRC carries the approach of Chatterjee and Qiu over to McDonald's SRC; the main advantage of the SRC (which is that it does not require training data and control limits can thus easily be determined in advance, eg, through Monte Carlo simulations) is retained while detection delays are significantly reduced because of the use of an adaptive sequence of control limits. Furthermore, owing to a trade-off favoring the lower computational burden and simplicity over the optimal performance, AC-SRC lacks some of the refined fine-tuning routines used in the Chatterjee and Qiu method. The author refrains from a detailed discussion of AC-SRC and refers the interested reader to previous studies [33,45].
Let the sequential rank of D n,I,p,q be denoted as The AC-SRC then starts with the same recursive equation as the SRC, ie, with C 0 = 0 and k a reference constant. Let T n denote the so-called "sprint length" expressing the time elapsed since C n was last 0, ie, Furthermore, let Y j be a random variable with distribution Chatterjee and Qiu showed that the conditional distributions of C n |T n in Equation (11) are easier to handle than the unconditional distribution of C n and under some regularity conditions (see [45] and references therein for details), depend only on j and the underlying process generating distribution, but not on n. Then, for any positive integer j max ≤ n, the distribution of C n can be approximated by means of the conditional distributions in Equation (11) as with Y * ∼ C n |T n > j max and I being the indicator function.
It can be shown that [44], given the observed process is in-control, the quantities R n / (n + 1) in Equation (10) are independent and discrete uniform on . Herein lies the key advantage of AC-SRC, in that for some fixed j max and k one can determine a sequence of control limits from with representing a simulation of Equation (10) for up to some N AC-SRC wherein the quantities R n / (n + 1) are substituted by the set of random variables as discrete uniform on . The sequence of control limits can then easily and without the need for training data samples be determined by means of Monte Carlo simulations as outlined elsewhere [33].
The AC-SRC signals a change to have occurred if or if To limit the computational burden, it is recommended to calculate control limits only up to a reasonably small j max after which the control limit is kept fixed at h AC-SRC j max . In addition, following recommendations by Chatterjee and Qiu, the reference constant k is set proportionate to the expected average sprint length E {T n }. Throughout this work, k was empirically calibrated such that , whereas the average run length (ARL) was set to a nominal ARL 0 =500. A detailed discussion as to the empirical calibration of control charts would, however, be beyond the scope of this work [33]. Note that as k increases, accordingly j max decreases E {T n }, that is, the likelihood of C n bouncing back to 0 increases [33,45]. For the particular use case examined here, the use of a small j max to allow for agile change detection is recommended (see Tables 1-4 and the respective discussion in the next section).
Finally, as with the SRC in the conventional l-SSA-CPD algorithm, the control chart is reinitialized after a change has been detected (by setting C n =0 and T n =0) to allow for the detection of multiple and potentially nearby change-points [32].
Monitoring of the l-SSA-CPD test statistic D n,I,p,q with N=20, M=10 by means of AC-SRC with j max =8, E {T n }=6 is showcased on a 5-minute tachogram excerpt of record 16,273 MIT-BIH NSRDB artificially contaminated with 2 and 8 ectopics in Figures 1 and 2, respectively. Note the remarkably low detection delay due to the use of AC-SRC.

Outlier Correction Using Recurrent Singular Spectrum Analysis Forecasting
Once an outlier has been detected at say n=τ, accounting for the inherent detection delay and accordingly an uncertainty as to the exact outlier location, the interval [τ − Δ 1 , τ], is designated as corrupted tachogram segment. Note that Δ 1 depends on the (average) detection delay and should be chosen carefully. However, it need not be exact or otherwise optimal. Then, a second interval of length Δ 2 immediately preceding the corrupted one is used to obtain an anomaly-free forecast to substitute the corrupted segment with.
The rationale of the proposed framework is depicted in Figure  3, with green and yellow rectangles representing the designated uncorrupted and corrupted segments and the detected change highlighted in red.
As the proposed method operates sequentially as new observations arrive, an appropriate uncontaminated data segment (either uncorrupted to begin with or previously cleaned) to serve as a basis for forecasting is always available. While it may be beneficial to use large values for Δ 2 (assuming sufficient past observations were available), keeping Δ 1 , Δ 2 as small as possible is recommended; as doing so allows for near real-time operation and significantly reduces the computational burden. For the purpose of this study, simplicity was prioritized over (more often than not unnecessary) optimality.      . Schematic depiction of the proposed R-R series outlier detection and correction framework. Adaptive control limits sequential ranks cumulative sum with low detection delay is used to monitor lightweight singular spectrum analysis (SSA) change-point detection's test statistic; when it signals (red vertical bar), a segment of length Δ1 immediately preceding the detection is labeled as being corrupted (yellow rectangle) and a larger, yet still reasonably small segment of length Δ2 (green rectangle) is used to construct a trajectory matrix for the recurrent SSA forecasting which eventually yields a short-term SSA forecast of length Δ1 which the corrupted segment is then substituted with.
Time series forecasting plays a crucial role in many scientific fields and arguably also represents one of the most popular (real-world) application areas of the SSA. Note that there exist different approaches for SSA forecasting [46], most notably recurrent and vector forecasting. Refraining from any detailed discussion (see [35][36][37][38] for excellent monographs on the subject matter), the recurrent SSA forecasting approach used in this study will briefly be outlined. As with other more advanced capabilities that go beyond the realm of basic SSA, forecasting and missing value imputation (note that the 2 are closely related with the latter being the more general problem incorporating forecasting as a special case) in a way sacrifice some of the model-free beauty and appeal of SSA by invoking certain model assumptions. On the other hand, this appears to be outweighed by the successful application of SSA forecasting in numerous areas [47][48][49][50][51][52]. Furthermore, Golyandina and Zhigljavsky [36] point out that for short-term forecasts, there is actually only very little use of the imposed model, that is, the series approximately satisfy some linear recurrent formulas. Two examples of the automatic detection and correction of outliers using the method proposed in this study are shown in Figures 4 and 5 and compared with the benchmark approach of replacing the corrupted segment with an immediately preceding block of R-R data, referred to as block replacement [28]. Note how in both cases, while all outliers have been detected and corrected, there are ectopic-free segments of tachogram which have also been corrected because of false alarms. artificially inserted ectopics using lightweight SSA change-point detection with N=20, M=10, Q=1, 75% of leading eigentriples and adaptive control limits sequential ranks cumulative sum with j max =6, E {T n }=4, ARL 0 =500 are corrected by means of the proposed method with Δ 1 =M, Δ 2 =N (top) or by replacing the corrupted segment of length Δ 1 with the immediately preceding block of equal length (bottom).

Performance Evaluation
The performance and utility of the proposed method are evaluated using records that are publicly available through Physiobank [34]. More specifically, excerpts of all 18 records in the MIT-BIH NSRDB are used. The European Society of Cardiology/North American Society of Pacing and Electrophysiology Task Force guideline [4] recommends to perform the HRV analysis on ECG recordings of at least 5 minutes, and the said length can be considered as the accepted standard for the short-term HRV analysis. Accordingly, for each of the 18 records in MIT-BIH NSRDB excerpts corresponding to the first 5-minute period of NSR are collected. Note that the records in MIT-BIH NSRDB are cardiologist-annotated, that is, they have manually been checked for errors in beat detection and labeling. Thus, one can circumvent the step of QRS complex-detection and use the R-R intervals as provided ( the first 5 minutes were free of ectopic beats and therefore selected. However, it will be pointed out that 4 records contained a couple of extreme outliers (most likely measurement errors), which were removed manually. Specifically, the 4 affected records were records number 16,272,18,177,19,088, and 19,140, for which 2, 1, 2, and 1 data points were removed, respectively. Besides being readily discernible because they deviate by some order of magnitude from surrounding interbeat intervals, these outliers are also marked as artifacts in the respective annotation files.
Next, the effect of a PVC on the R-R interval was simulated by decreasing the respective entry to 2/3 of its actual value and by increasing the following entry to 4/3 of its value (Textbox 1). That is, for a PVC at say n = τ, one sets which is consistent with the actual signature of a single PVC (see, eg, [53]) in an R-R series characterized by a premature beat (and accordingly a significantly smaller R-R interval) followed by a compensatory pause (longer R-R interval) followed by a return to baseline (NSR).
Extensive Monte Carlo simulations were then used to determine the performance characteristics of the proposed method, wherein a large number of artificial R-R series based on the 18 5-minute MIT-BIH NSRDB records were created by first randomly drawing an integer between 1 and 6 and then randomly selecting the position(s) within the series where the respective PVCs would be placed (with the constraint that PVCs be at least 5 samples apart).

Results
With the proposed method consisting of 2 stages (detection and correction), a separate analysis of each stage appears appropriate.

Detection Step
In reporting results pertaining to the detection performance of the first stage, established metrics commonly used in the literature are relied upon the sensitivity (Se), specificity (Sp), and accuracy (Acc), which are defined as and with TP, FP, TN, and FN representing the number of true positives, false positives, true negatives, and false negatives, respectively. Intuitively, sensitivity quantifies the ability to correctly detect (actual) anomalies, specificity quantifies the proportion of nonabnormal segments that are correctly identified as such, and accuracy combines both of the aforementioned aspects. More formally, Se expresses the empirical statistical power, 1 -Sp the type I (false positive), and 1 -Se the type II (missed detection/false negative) error rate.
As reported previously [32], for an event occurring at a time instance n = τ, one allows for a certain detection delay Δ d and considers a signal from the AC-SRC control chart as true positive if it falls in the interval [τ, τ + Δ d ]. All results presented in this study were obtained using Δ d = Δ 1 = M.
Note that the design and calibration of a detection algorithm involves an inherent trade-off between false alarms and missed detections and that furthermore reducing the false alarm rate usually entails an increase in the detection delay. For this specific use case, high sensitivity and short detection delays are arguably of the highest priority; for one wants a high likelihood of detecting all actual outliers (to reliably "clean" the R-R series, thereby transforming it into the respective N-N series) with a manageable (ie, small) detection delay to be able to narrow down the corrupted segments in the series as much as possible. The latter is particularly relevant because the proposed method substitutes the (allegedly) corrupted segment by an approximation of the outlier-free segment obtained using recurrent SSA forecasting; one aims to keep the forecasting horizon small to obtain acceptable results without having to tweak the SSA parameters extensively. This comes at the expense of higher false alarm rates, as is evidenced by the results presented in Tables 1-4.
As in the considered design, j max is linked to the average sprint length E {T n }, better agility is generally achieved with rather small j max in a use case, as the one considered here, were AC-SRC monitors for large deviations [33,45].
While AC-SRC achieved satisfactory performance over a wide range of values j max (results not reported here), upon examination of the results reported in Tables 1-4, which are based on 18,000 R-R series, j max =6 was chosen to proceed to the second stage of the proposed method (ie, the correction of corrupted segments of the R-R series).

Correction Step
As mentioned earlier this design choice entails higher false alarm rates, as discernible in the 2 examples shown in Figures  4 and 5, which both exhibit actually uncorrupted segments that have been "corrected" by the proposed algorithm because of false alarms, thereby introducing some distortions. More importantly though, both examples also clearly show that all artificially inserted outliers because of PVCs were detected and appropriately corrected. Results pertaining to the second stage of the proposed method are reported in Table 5, wherein the root mean square error (RMSE) is used to quantify the discrepancy between the outlier-cleaned series and the actual (uncontaminated) series.

Let
be the uncontaminated N-N series (ie, the ground truth) and denote our allegedly cleaned R-R series.
Calculation of the RMSE is straightforward, Furthermore, to better compare the RMSE of the proposed method to its competitor (the commonly used block replacement method is used here for benchmarking purposes), consider also the relative RMSE (RRMSE) given by Note that if the RRMSE, as given in Equation (18) is <1, it implies that the novel approach presented in this paper outperforms the competing benchmark approach by 100·(1 − RRMSE). It shall be emphasized, however, that the block replacement method as it its applied here (to correct for outliers in the R-R series) uses the first stage of the proposed method as well, its performance should thus be seen in the context of being applied to pretty tight segments of the series that ought to be substituted, which is all because of the AC-SRC l-SSA-CPD algorithm of the first stage of the framework presented in this study.
That being said, the results reported in Table 5, which are based on 180,000 R-R series, show that the proposed method, which substitutes corrupted tachogram segments by an approximation of the respective outlier-free segment obtained by means of recurrent SSA forecasting, clearly outperforms the competing block replacement method. In fact, it outperforms its competitor for all 18 records (on which the extensive simulations built upon), on average by almost 30%.

Background and Relevance
The HRV has long been known to allow for valuable insights into the intricate control mechanisms of the ANS and has a long track record of exponentially growing research interest and output. A crucial prerequisite for any HRV analysis is the exclusion of all artifacts and abnormalities, that is, a clean N-N series is required. Such a clean N-N series is virtually never available, mostly because of various interferences, beats not detected by the QRS complex-detector, and various highly prevalent heart rhythm disturbances.
Compared with the overall research effort directed toward the HRV, the issue of automatic R-R series (pre)processing has received rather little attention. In fact, manual editing still represents the gold standard. It should be noted that every automatic approach comes with risks and benefits; the former lie in the inherently nonzero type I and type II error rates, whereas the latter include, but are not limited to, considerable savings in both cost and time. Considering that error rates for manual approaches are nonzero as well and that they are most likely an increasing function of time (whereas in automatic approaches they are usually not), the choice of which one or which particular combination to use should always be made on an individual basis, considering all requirements and circumstances of the particular case. Accordingly, while the proposed method could be construed as a potential alternative to manual editing, in the light of the above considerations, this author would rather recommend it as a complementary method and or a preprocessing tool.
The main contribution of this work is the introduction of a general framework of low computational complexity that allows for an automatic near real-time detection and correction of outliers in R-R series based on the SSA. While related work pertaining to the use of the SSA in ECG and R-R data processing exists [54], to the best of the author's knowledge, this study is the first to propose a general SSA-based framework that handles both detection and correction of (unwanted) anomalies in the R-R series.

Principal Findings
An extensive simulation study comprising 198,000 5-minute R-R series, obtained by randomly inserting varying amounts of simulated ectopic beats in records taken from the MIT-BIH NSRDB was conducted to assess the artifact detection and correction performance of the proposed algorithm.
It was shown that the proposed algorithm reliably detects outliers in the R-R series and achieved an overall sensitivity of 96.6%, specificity of 98.4%, and accuracy of 98.4%. Furthermore, it compares favorably in terms of discrepancies of the cleaned R-R series compared with the actual N-N series, outperforming a block replacement approach on average by almost 30%. It should also be emphasized that a suboptimal pragmatic approach was pursued, deliberately refraining from an optimization of the various (SSA and AC-SRC) tuning parameters, which would most likely have resulted in further performance improvements at the expense of simplicity and computational complexity.
Other important characteristics of the proposed method include the ability to operate in near real-time, the almost entirely model-free nature of the framework, and the low computational complexity. Moreover, it should be pointed out that the proposed method is not limited to the R-R series but can be applied broadly, as all of its components are deliberately kept as general as possible. This entails the additional benefit of not being limited to the removal of PVCs (although only this has extensively been investigated and tested so far) but rather being able to deal with all kinds of anomalies that may contaminate the observed series of interest.

Limitations and Future Research
This work has several limitations, some of which open (new) avenues for future research.
A major limitation is because of the lack of established and generally recognized and validated procedures other than manual editing, which transforms the performance comparison among competing methods in a nontrivial task. Second, to increase the readability and decrease the length, the paper fails to provide general recommendations and guidelines as to how various tuning parameters should be chosen. Finally, the SSA represents a very active research field, as in the last couple of years, and several promising advances and new developments have been reported, but thus far have not been considered in the research reported in this work. This, however, does not infringe on the relevance of this work, as the objective was to introduce a simple, general framework of low computational complexity and to investigate the use of SSA in both the detection and correction step.
Several of the abovementioned recent developments, however, appear to open promising avenues for future research. In particular, the author intends to expand and improve the work presented here by focusing on the issue of the SSA parameter selection and by considering latest developments, especially pertaining to SSA forecasting. For the latter, the interested reader is referred to recent publications by Hassani and Kalantari [55][56][57][58].