Measuring Heart Rate Variability in Free-Living Conditions Using Consumer-Grade Photoplethysmography: Validation Study

Background: Heart rate variability (HRV) is used to assess cardiac health and autonomic nervous system capabilities. With the growing popularity of commercially available wearable technologies, the opportunity to unobtrusively measure HRV via photoplethysmography (PPG) is an attractive alternative to electrocardiogram (ECG), which serves as the gold standard. PPG measures blood flow within the vasculature using color intensity. However, PPG does not directly measure HRV; it measures pulse rate variability (PRV). Previous studies comparing consumer-grade PRV with HRV have demonstrated mixed results in short durations of activity under controlled conditions. Further research is required to determine the efficacy of PRV to estimate HRV under free-living conditions. Objective: This study aims to compare PRV estimates obtained from a consumer-grade PPG sensor with HRV measurements from a portable ECG during unsupervised free-living conditions, including sleep, and examine factors influencing estimation, including measurement conditions and simple editing methods to limit motion artifacts. Methods: A total of 10 healthy adults were recruited. Data from a Microsoft Band 2 and a Shimmer3 ECG unit were recorded simultaneously using a smartphone. Participants wore the devices for >90 min during typical day-to-day activities and while sleeping. After filtering, ECG data were processed using a combination of discrete wavelet transforms and peak-finding methods to identify R-R intervals. P-P intervals were edited for deletion using methods based on outlier detection and by removing sections affected by motion artifacts. Common HRV metrics were compared, including mean N-N, SD of N-N intervals, percentage of subsequent differences >50 ms (pNN50), root mean square of successive differences, low-frequency power (LF), and high-frequency power. Validity was assessed using root mean square error (RMSE) and Pearson correlation coefficient (R). Results: Data sets for 10 days and 9 corresponding nights were acquired. The mean RMSE was 182 ms (SD 48) during the day and 158 ms (SD 67) at night. R ranged from 0.00 to 0.66, with 2 of 19 (2 nights) trials considered moderate, 7 of 19 (2 days, 5 nights) fair, and 10 of 19 (8 days, 2 nights) poor. Deleting sections thought to be affected by motion artifacts had a minimal impact on the accuracy of PRV measures. Significant HRV and PRV differences were found for LF during the day and R-R, SDNN, pNN50, and LF at night. For 8 of the 9 matched day and night data sets, R values were higher at night (P=.08). P-P intervals were less sensitive to rapid R-R interval changes. Conclusions: Owing to overall poor concurrent validity and inconsistency among participant data, PRV was found to be a poor surrogate for HRV under free-living conditions. These findings suggest that free-living HRV measurements would benefit from examining alternate sensing methods, such as multiwavelength PPG and wearable ECG. (JMIR Biomed Eng 2020;5(1):e17355) doi: 10.2196/17355 JMIR Biomed Eng 2020 | vol. 5 | iss. 1 | e17355 | p. 1 http://biomedeng.jmir.org/2020/1/e17355/ (page number not for citation purposes) Lam et al JMIR BIOMEDICAL ENGINEERING


Motivation
With the growing ubiquity of commercially available wearable technologies, obtaining long-term physiological measurements under free-living conditions is feasible and permits longitudinal examination of ecologically valid patterns. This presents an opportunity for continuous patient monitoring under free-living conditions, including the potential to identify at-risk individuals (eg, patients with cardiac disease). Heart rate variability (HRV) is a well-established, powerful metric used to assess cardiac health, including autonomic nervous system function regulating cardiac activity. Compared with an individual's heart rate (HR) averaged over a short period, HRV measures variations in HR primarily as an indicator of the efforts of the sympathetic and parasympathetic nervous systems to achieve an optimal cardiac response under constantly changing stimuli [1]. Previous research has explored the use of HRV monitoring in predicting or detecting sleep quality [2], mental stress [3], chronic pain [4], posttraumatic stress disorder [5], bipolar disorder [5], and cardiac health [6].

Measuring HRV
The (gold) criterion standard for measuring HRV is through an electrocardiogram (ECG) to obtain a direct recording of cardiac electrical activity. On ECG, the R wave represents the maximum upward deflection of a normal QRS complex. The duration between two successive R waves defines the R-R interval [7], which is used to measure HR and HRV. Although wearable ECGs exist, they typically require electrodes affixed to the skin, which makes them obtrusive and can cause skin breakdown, and they are also prone to motion artifacts during day-to-day activities [8]. Alternatively, photoplethysmography (PPG) uses an optical sensor widely used to unobtrusively track mean HR, especially in wrist-worn devices (eg, Fitbit).

PPG for Pulse Rate Variability
PPG sensors measure changes in pulsatile blood flow within an individual's vasculature using color intensity signals [9]. Signal peaks associated with the flow of blood are used as indicators of HR, allowing for the calculation of peak-to-peak (P-P) intervals. PPG sensors do not directly measure HRV; instead, they measure pulse rate variability (PRV), the change in vessel pulse periods, from which P-P intervals denote a pulse rate (PR) [10]. PPG sensors can be placed at a variety of measurement sites including the fingers, wrist, brachia, ear, forehead, and esophagus without requiring additional equipment. This makes PPG especially convenient for pervasive cardiac monitoring [11], with well-validated use for mean HR measurements [4]. Although evidence examining PPG capabilities to accurately measure HRV shows promise, studies comparing PPG with gold standard ECG methods under free-living conditions remain limited.
The accuracy of PRV as a measure of HRV has been investigated with clinical devices under controlled, and often stationary, conditions [12][13][14][15][16][17]. Although these studies indicate that PRV may be a useful as a proxy measure of HRV using medical-grade devices under controlled conditions, studies using wearable consumer-facing devices have shown mixed results. These few studies largely use short-term collections in controlled circumstances, some of which do not simultaneously collect ECG [18,19]. A systematic review by Georgiou et al [20] found that wearable devices can provide accurate measurement of HRV measures at rest; however, accuracy declines as exercise and motion levels increase. The review also showed that heterogeneity in sensor position, detection algorithm, experimental settings, and analysis methods from existing studies limits the evidence. A review by Shäefer and Vagedes [21] found similar results, suggesting that physical activity and mental stressors lead to unacceptable deviations between PRV and HRV. Ultimately, further research is required to determine the efficacy of PRV in estimating HRV during free-living conditions in which individuals are unrestricted and engaging in their daily activities [22].

Limitations of PPG
PPG sensors have been found to be sensitive to motion artifacts, changes in blood flow caused by movement, compression and deformation of the vasculature arising from pressure disturbances at the interface between the sensor and the skin [11], and light leaking between the sensor and the skin [23]. Some studies have examined the removal of motion artifacts from PPG signals using signal processing techniques and acceleration as a reference [23][24][25][26][27][28]. For example, methods involving accelerometry have shown promise for improving coherence by editing signals likely influenced by motion artifacts [28,29]. Baek and Shin [30] collected PPG measurements over 24 hours using a custom device and filtering method, recommending a subset of HRV metrics as good targets for continuous HRV tracking using commercial devices. Morelli et al [28] conducted a study evaluating the accuracy of a consumer-grade PPG (Microsoft Band 2) for HRV estimation during less restrictive, but controlled, conditions (eg, sitting and walking) over 10-min trials. Errors likely caused by motion artifacts during walking were attenuated by using corresponding accelerometer signals to delete sections of the data corrupted by motion artifacts.

Objectives
Although HR and PR are correlated and closely related, the use of PRV to estimate HRV requires further research, especially under free-living conditions. In this study, the concurrent validity of PRV measurements from a consumer-facing PPG sensor is compared with HRV measurements from a portable ECG under 2 unsupervised conditions up to 4.5 hours each: (1) while engaging in regular activities of daily living and (2) during sleep. A secondary goal of this study is to examine factors influencing estimation errors of PRV for HRV, including motion artifacts, measurement conditions, and editing approaches.

Participants
A convenience sample of healthy individuals aged 18-65 years was recruited for the study. Individuals with a history of cardiac and/or sleep disorders were excluded to minimize the collection of irregular cardiac signals. Under these conditions, approval for this study was granted by the University of Waterloo Research Ethics Committee on September 5, 2017, filed under protocol #31197.

Device Setup
A total of 2 wearable devices were used to acquire cardiovascular signals in this study: (1) a commercially available optical PPG wearable device (Microsoft Band 2 or MB2, Microsoft) and (2) Figure 1 [31]. The participant's skin was prepared by shaving and sanitized using hospital-grade alcohol wipes before electrode placement. Electrodes were connected to a Shimmer3 ECG, worn at the waist with a strap, and all leads were taped to the chest to prevent tangling and static, and minimize motion artifacts. On the smartphone, the Multi-Shimmer Sync mobile app (Shimmer, Dublin, Ireland) was used to record ECG data from the Shimmer3 ECG. The MB2 was worn on the participant's wrist of choice as tightly as possible, without causing discomfort. MB2 size (small or medium) was selected to fit the size of the participant's wrist. A third-party mobile app, Companion for Microsoft Band (released by Pain in My Processor, Google Play Store), was used to log data from the MB2 to the smartphone.

Participants' Instructions
Given the free-living nature of data collection, participants were instructed on how to set up and monitor device connection and logging status to facilitate troubleshooting. To ensure proper electrode placement, a (trained) researcher placed the electrodes in the 4-lead bipolar limb lead configuration (Figure 1) during the first data collection (ie, day). For the second collection (ie, night), electrodes were left on, replaced by the research assistant, and/or marked by location and replaced by the participant. Before the second collection, ECG signals were visually examined to ensure that the QRS complexes were clearly identifiable. Participants were instructed and encouraged to contact a researcher at any time in case of questions or concerns during data collection.

Postprocessing
Following data collection, all postprocessing and statistical analyses were conducted using MATLAB 2018a (MathWorks). Figure 2 outlines the steps taken in postprocessing.

Figure 2.
Postprocessing of data from the Shimmer and Microsoft Band. ECG: electrocardiogram; HRV: heart rate variability; P-P: time between 2 P peaks in a photoplethysmogram or peak-to-peak intervals; PRV: pulse rate variability; R-R: time between 2 R peaks in an ECG.

Synchronizing Devices
Shimmer3 and MB2 were coarsely synchronized by aligning triaxial acceleration peaks from tapping both devices simultaneously on a table. Each device was tapped 3 times in 2 orientations with 10 s of rest between orientations. Fine synchronization was performed using a cross-correlation method described below (cross-correlation synchronization).

ECG Data Processing
Both LA-RA and LL-RA ECG signals were filtered using a first order bandpass Butterworth filter from 1 to 25 Hz. A maximal overlap discrete wavelet transform with a Daubechies least-asymmetric wavelet with 4 vanishing movements was used to enhance the R peaks in the ECG, followed by a threshold-based peak-finding function used to identify the R-peaks [32,33]. In one sample (participant 2, daytime), the wavelet detection algorithm more accurately and consistently detected the T wave of the ECG signal and was used as a proxy for the QRS complex, previously shown to give results similar to those of R-peak detection [34]. A time series of R-R intervals was extracted from the detected R-peaks, and outlier values outside of the physiological range of values for a healthy individual at rest, walking, or during sleep (R-R<0.3 s or R-R>2.5 s) were removed [35][36][37]. To remove transients associated with artifacts or noise, segments of at least 15 consecutive R-R intervals were included in the analysis. Longer segment thresholds (30, 60, and 120 consecutive R-R intervals) were tested with negligible effects on the results. On the basis of the signal that provided more R-R intervals, either the LA-RA or LL-RA electrode pair signal was chosen for processing and analysis.

PPG Data Processing
P-P intervals and corresponding time stamps were recorded directly from MB2 outputs as the time interval between 2 continuous heartbeats [38]. Note that the temporal resolution of MB2 is limited to 10 ms. On the basis of existing literature reporting signal processing methods to edit R-R intervals and remove artifacts, three methods were used to identify and delete artifacts in the P-P intervals outputted by the MB2, resulting in 4 conditions of P-P data. Deletion was chosen as the editing technique (as opposed to interpolation) because motion artifacts would likely affect consecutive samples, making interpolation challenging. In addition, the long-term nature of data collection would mitigate one of the major concerns associated with deletion, the loss of samples [39]. The 4 processing conditions were as follows: • None (condition A): This condition contains the raw P-P intervals.
• Moving average deletion (condition C): Threshold deletion (as described in B above) and removing changes in P-P intervals faster than physiologically plausible indicated by a moving average filter. This was done following Morelli et al [28], discarding values for which |PP t -µ 10 |≥0.5µ 10 , where PP t refers to the P-P interval data and µ 10 is a 10 s moving average.
• Acceleration-based deletion (condition D): A series of threshold filters, moving average filters (described in C above), and an acceleration filter. Considering that low PPG signal quality may be attributable to movement, Morelli et al [28] removed signal segments affected by motion artifacts by estimating periods of signal quality associated with the corresponding accelerometry time series, W t , and then removing P-P intervals where W t was found to exceed a threshold, k. k was identified by examining the correlation between W t and error, where W t is calculated as an average of w t over a window of duration τ, and W t is calculated as follows: In this study, no significant correlation between W t and was found. As such, a threshold of =0.02 m/s 2 was used to filter the data with τ=40 s (the same parameters as used by Morelli et al) [28].

Data Synchronization
Following coarse synchronization of MB2 and Shimmer3, consistent delays between the 2 devices were observed. To identify the highest correlation between devices, a cross-correlation between P-P and R-R data was conducted. The estimate of the time-shift was applied to the P-P data, similar to the method used by Pietilä et al [40]. P-P intervals were then matched to R-R intervals by matching data points with the closest time stamps. If a data point did not have a matching interval within 1 s, the interval was deleted. The 1-s delay was chosen to accommodate for delays in Bluetooth transmission and pulse transit time. After matching, the remaining data were divided into 2-min windows from which the HRV and concurrent validity metrics were calculated [41].

HRV Metrics for Analysis
After postprocessing, the following time domain HRV and PRV features were extracted for each trial, where N-N refers to either R-R or P-P: For spectral measures, R-R and P-P intervals were converted to instantaneous HR (60/N-N, where N-N is interval time in seconds) and then interpolated to 4 Hz using a piecewise cubic Hermite interpolation (MATLAB function "pchip"). This ensured regular time intervals between data points, a prerequisite for estimating the Fourier transform and signal power. The Fourier transform was performed (using "fft" function in MATLAB) on the entire data set for each participant. This allowed for the calculation of frequency domain HRV features such as LF (0.04-0.15 Hz) and HF (0.15-0.40 Hz). LF and HF were computed in normalized units by the sum of LF and HF. The ratio of LF to HF was also reported.
To compare PPG-derived metrics across collection and processing conditions (ie, day-or nighttime collection, filtering condition), two-tailed paired t tests were used. Bland-Altman plots were generated to illustrate the agreement between R-R and P-P intervals. In the Bland-Altman plot, the difference between each P-P and R-R measurement is plotted against the mean of each measurement [43].

Overview
This section presents the results of (1) investigating the concurrent validity between R-R and P-P intervals across published filtering methods, (2) a comparison between ECGand PPG-derived metrics of HRV, and (3) a comparison across free-living data collection conditions (ie, day and night). A total of 10 volunteers were recruited (3 men and 7 women, aged 20-61 years) for this study for a total of 19 trials (1 day and 1 night per participant). One participant's ECG night data were corrupted and therefore not analyzed or reported.
After processing, a large amount of data was lost. The number of matched and windowed N-N intervals is described in Table  1; all comparison statistics were calculated on the basis of these data. The percentages of compared intervals were calculated by dividing the number of matched and windowed samples by the total number of R-R or P-P intervals detected from the ECG or MB2, respectively. A larger data sample was acquired at night than that acquired during the day. Despite formal instructions and training on the operation and charging of the sensor systems, several technical barriers were frequently encountered that limited the number of samples in each trial. These included inadvertent misplacement of ECG electrodes or MB2, insufficient battery charging before night collection, and/or dropped Bluetooth stream to the mobile device. Table 2 compares the concurrent validity of P-P data with that of the R-R data across all processing conditions, including RMSE and R 2 . The largest differences were observed in the RMSE between the raw (A) and filtered (B, C, and D) conditions. The RMSE ranged between 46 and 285 ms across all conditions. Increased editing reduced the average error (RMSE). Under condition C, error was further examined by generating Bland-Altman plots comparing the P-P intervals with R-R intervals, as shown in Figure 3. Although the mean error is close to zero for both day and night conditions, the limits of agreement were greater than 200 ms. Across all conditions, R 2 values ranged from 0 to 0.66. Editing did not have a large impact on R 2 . Although R 2 improved at night, none of the correlations were considered strong; 2 of 19 (all night) were moderate, 7 (2 days, 5 nights) were fair, and 10 (8 days, 2 nights) were poor. Of the 19, 16 (9 days, 7 nights) paired t tests between R-R and P-P intervals under condition C yielded P=.01, indicating significant differences between ECGand PPG-based methods.

Concurrent Validity Across the Editing Techniques
Under condition D, no data sets showed strong correlations. Only 3 (1 day, 2 nights) were moderate, 7 were fair (1 day, 6 nights), and 9 were poor (7 days, 2 nights). Paired t tests between matched R-R and P-P intervals edited under condition D were significant for 12 trials (5 days, 7 nights). Notably, condition D reduced the amount of data available for analysis, especially during the day. From condition C to D, the average sample loss was 40.18% (SD 29.59) during the day and 3.73% (SD 4.37) at night.
Compared with condition C, condition D improved RMSE and R 2 slightly during the day and varied by trial. The mean correlation between error and W t was 0.28 (SD 0.24), with a range of 0.13 to 0.70 for day data, and 0.29 (0.21), with a range of 0.16 to 0.73 for night data. Figure 4 [44] shows the error and W t for a sample showing lower correlation between W t and error (R 2 =0.16) and a sample trial with higher observed correlation (R 2 =0.50).  Table 3 compares the HRV and PRV measures across participants under condition C, as this condition yielded the highest concurrent validity for most participants while retaining sample size. The findings in Table 3 are based on a 3311.30 (SD 1316.13) matched samples for day data and 7303.00 (SD 4075.76) for night data. Under condition C, paired t tests revealed no significant differences between HRV and PRV measures. At night, SDNN, pNN50, RMSSD, SD1, SD2, LF, HF, and LF/HF ratio metrics were observed to be significantly different. Significant differences between HRV and PRV measures were observed in more measures at night, a condition during which motion artifacts are expected to be lower, allowing for collection of more accurate PRV data. Note that the temporal resolution of MB2 is limited to 10 ms, but many of the observed differences between R-R and P-P intervals are larger.  Compared with processing condition C, similar results were observed in condition D (Multimedia Appendix 1). Under condition D, paired t tests revealed significant differences between HRV and PRV measures for no measures during the day, but there were significant differences in R-R and pNN50 at night. Although this may be attributed to condition D using motion artifact editing, the large number of samples edited from condition C to D may partially explain these findings. Given the large sample loss associated with condition D and a lack of strong correlation between W t and error, the remainder of this study focuses on the results from processing condition C (over D).

Comparison of HRV and PRV Measures
Time series plots of matched and edited R-R and P-P intervals ( Figure 5) highlight several differences between the ECG and PPG methods. Similar to the mean N-N results, the data sets follow the same trends on average, but there are notable differences. First, P-P intervals seem to be less sensitive to changes in R-R intervals, as many shorter and longer intervals were not well matched. Fewer artifacts were observed in the R-R intervals that did not appear in the P-P interval signal, which may be attributable to less R-R interval editing. Figure 5. Time series of matched time between 2 R peaks in an electrocardiogram and time between 2 P peaks in a photoplethysmogram or peak-to-peak intervals for a single participant under processing condition C during (A) day and (B) night. P-P: time between 2 P peaks in a photoplethysmogram or peak-to-peak intervals; R-R: time between 2 R peaks in an electrocardiogram.
Poincaré plots for the same participant under condition C are shown in Figure 6. The P-P and R-R plots during the day appear qualitatively different. Although plots of night data demonstrate more similarities, a greater number of outliers for shorter P-P intervals were observed. Figure 6. Poincaré plots for a single participant under processing condition C for (a) P-P intervals during the day, (b) R-R intervals during the day, (c) P-P intervals at night, and (d) R-R intervals at night. P-P: time between 2 P peaks in a photoplethysmogram or peak-to-peak intervals; R-R: time between 2 R peaks in an electrocardiogram. Table 2 shows the difference in concurrent validity for night data versus day data under condition C. Closer examination of the data reveals further details. For 8 of 9 participants with a day and night data set, the average R 2 values were higher at night. The increase in R 2 is highlighted for one participant in Figure 7, where the R 2 day =0.26 and R 2 night =0.40. The magnitude of R 2 improvements from day to night differed between participants, ranging from −0.03 to 0.60 with an average improvement of 0.22 (SD 0.31). Paired t tests comparing changes in R 2 were significant (P=.01). Night collections were found to have a slight decrease in RMSE, indicated by a mean decrease in RMSE of 24 (SD 45) ms, ranging from −89 ms to +40 ms difference across participants. For the participant highlighted in Figure 7, RMSE day =148 ms and RMSE night =138 ms. Paired t tests comparing changes in RMSE from day to night approached significance (P=.09).

Comparison of Free-Living Data Collection Conditions (Day vs Night)
Although night data had more matched samples, an unpaired t test revealed that the difference between night and day samples was significant (P=.03). The mean percent increase in samples from day to night was 138.61% (SD 159). Differences in percentage loss of data owing to filtering (condition C vs condition A) were slightly higher during the day, averaging 13.31% (SD 11.47), versus night, 8.16% (SD 6.05). This difference did not reach significance under the unpaired t test (P=.25).
Tables 2 and 3 demonstrate that many PRV estimates of HRV measures were more accurate at night, with |Error| avg decreasing or remaining the same for NN, SDNN, RMSSD, SD1, SD2, and LF/HF ratio. |Error| avg for LF and HF remained approximately the same, whereas |Error| avg increased for pNN50 at night. Although |Error| avg generally decreased, paired t tests revealed more differences between PRV and HRV estimates across participants for night samples than day.

Principal Findings
This paper examined the accuracy and concurrent validity of PRV measurements from a commercially available PPG sensor against HRV measurements obtained from a portable ECG sensor during unsupervised daytime and nighttime conditions. Accuracy and concurrent validity were examined across different editing methods and day and night collection conditions. In general, concurrent validity and HRV metrics were stronger at night compared with daytime conditions. Although collection during the night was more accurate with a lower mean error, this finding was not generalizable across all participants. Editing to remove outliers was effective in reducing noise, as reflected by the reduced RMSE for conditions B, C, and D. However, efforts to remove samples affected by motion artifacts using accelerometry (ie, condition D) were not as effective in this study compared with previous studies. The implications of these findings on ambulatory measurement of HRV using a commercially available PPG sensor to indicate health are discussed.
Although PPG sensors have strong mean HR measurement capabilities, the results from this study indicate poorer HRV capabilities. As expected, both ECG and PPG methods demonstrated similar mean R-R values with differences of less than 20 ms, reflecting established capabilities to estimate mean HR [17,20]. Examining beat-to-beat intervals using Bland-Altman plots, the mean error is close to zero (Figure 3). However, the wide variability of both under-and overestimated intervals indicates the presence of error-inducing factors, reflected in lower correlation (R 2 ) and large differences in calculated HRV metrics. Furthermore, Bland-Altman ( Figure  3 The implications of PPG sensing errors on HRV metrics are highlighted in Table 3. pNN50 and LF/HF ratios were particularly sensitive to errors in point-to-point accuracy. PPG-derived estimates of pNN50 were poor, which corroborates previous reports of up to 30% error [12,21]. In addition, LF/HF ratio estimation errors were anticipated to be related to poorer HF estimates during the day arising from larger and more frequent (wrist) motion associated with regular activities of daily living. Across day and night collection conditions, SDNN estimates were similar when comparing ECG and PPG methods. SDNN has been shown to be associated with daytime occupational stress and has been hypothesized to demonstrate the parasympathetic autoregulation of the cardiac system in response to variations in cardiac output [45,46].

Day Versus Night Collection
When comparing day and night collection conditions, concurrent validity and HRV metrics indicate more accurate HRV estimates at night. Improved concurrent validity at night may be attributed to fewer errors related to ambient light changes at night [47,48], as presumably during sleep, the lighting conditions are consistently darker. An important distinction between day and night was the larger sample size at night, likely owing to a more consistent Bluetooth stream and reduced noise arising from stationary conditions at night (ie, sleeping, lying down). Although mean R 2 values (condition C: mean 0.34, SD 0.21; condition D: mean 0.34, SD 0.21) were highest during night collections, the range of improvements varied across participants. Conversely, paired t tests revealed greater differences between PRV and HRV metrics (Table 3) at night. This may be attributed to the larger variability observed during the day, likely associated with a larger set and magnitude of motion during activities of daily living (compared with night). Considering significantly stronger concurrent validity measures ( Table 2), coupled with smaller mean differences in HRV measures, we consider PRV estimates to better reflect HRV metrics at night. Although data collected at night may have improved concurrent validity, it is important to note that individuals may not have gone to bed or fallen asleep immediately after beginning the data collection. As a result, metrics taken at night may have captured features of wakefulness, such as shorter N-N intervals. For example, unaccounted for time in bed while remaining awake may have skewed the shape of the Poincaré plots as well as metric SD2.

Impact of Editing
Simple editing methods to improve PPG signals were examined in this study. PPG recordings are known to be affected by motion artifacts, contact force, posture, and ambient temperature [11]. Owing to the free-living nature of the study, the latter factors were not controlled. By adopting methods established by Morelli et al [28], RMSE improved by removing physiologically implausible intervals (condition B) and concurrent validity improved by deleting areas with rapid changes (condition C) at night. However, screening for motion artifacts using accelerometry signals (condition D) was ineffective at improving PPG-derived signals and HRV estimates. This is consistent with the findings by Baek and Shin [30], who were also unsuccessful in obtaining accurate long-term free-living recordings of wrist PPG using a custom device, even when performing deletions based on acceleration and P-P intervals differing by more than 15%. These findings, along with those by Georgiou et al [20], suggest that under unrestricted conditions, PRV is a poor estimator of HRV. Although other studies have looked to improve HRV estimation using alternative editing and correction methods [39,44,49], an exhaustive investigation of correction methods is beyond the scope of this study.
Our finding of relatively ineffective use of motion artifact compensation suggests that other factors affect PPG signals. For example, changes in respiration and peripheral vascular factors (ie, vascular volume, vasomotor activity, and vasoconstrictor waves) are known to affect the AC and DC frequency components of the PPG waveform [50]. In particular, the effect of peripheral vascular factors affects pulse transit time (PTT), or the time delay required for blood to travel between the heart and peripheral tissues [17]. Considering the range of daily activities (eg, body position changes [51], stress, and physical activity) lead to fluctuations in blood pressure [52], an assumed constant PTT is a likely source of error in estimating PRV parameters.

Limitations
The primary limitations of this study were the sample population and technical limitations of the devices. In this study, a convenience sample of 18-to 65-year-old participants with no known cardiac history participated. Although those with known cardiac conditions were excluded, the presence of underlying vascular disease in our cohort is unknown. As such, the findings may not be applicable to target disease populations. The impact of vascular conditions, such as atherosclerosis and cholesterol deposits in the arterial walls leading to decreased vessel compliance, which have been shown to alter pulse waveform from the classic triphasic pattern to mono-or biphasic patterns [53], remains to be examined. Although the number of participants was relatively small (n=10), the large number of within-participant samples (>1500 matched interval points per participant) and analyses supports the overall questions regarding sensor comparisons to estimate HRV.
The devices used in this study were limited in several ways. Both Shimmer3 and MB2 devices logged using separate device clocks, with potential for drift (approximately 1-2 s) over the course of a single trial. The devices were synchronized using an external mechanical stimulus (ie, 3 taps in 2 orientations) and by applying a data-driven delay estimate (ie, cross-correlation). Although these procedures have been used in previous studies with good results coupled with qualitative and quantitative observation of synchronized signals, the potential for dropped samples or desynchronization exists. The publicly available documentation for MB2 offers little to no insight into R-R interval processing or adjustment for when faced with motion artifacts and is no longer commercially available at the time of writing. Of note, signal drops were observed sporadically, including (1) large amplitude arm movements and (2) when MB2 was out of Bluetooth range from the smartphone for long periods (>10 min). We interpret these signal drops as obvious situations where motion artifacts and wireless communication are severely challenged with little to no impact on our findings. Furthermore, the resolution of RR intervals reported by MB2 was 10 ms, limiting accuracy similar to quantization error (ie, round-off errors). Given the large number of samples, resolution limitations are unlikely to affect mean values (eg, mean RR) but may increase variability (eg, RMSSD) estimates. However, the observed underestimation is unlikely to arise from quantization errors and are interpreted as systematic errors associated with the sensing method.

Implications for Future Work
Wearable technologies are becoming more sophisticated with commercially available products capable of providing consumers access to information previously limited to clinical settings, including HRV and ECG data to identify arrhythmias [54]. With this in mind, it is important to understand when and if the data can be considered valid and reliable. This study provides evidence that the relationship between PRV and HRV varies throughout the day, likely attributable to dynamic changes in the peripheral vasculature. The study findings suggest that PPG-derived measures of HRV are reasonable under particular conditions (ie, at night), wherein this relationship is relatively stable for some HRV metrics (ie, SDNN and Poincaré axes). A deeper examination of factors modifying HRV estimation, particularly vascular factors, is yet to be conducted. As stated previously, our conclusions are drawn primarily from the HR and acceleration data. To further study these windows of high correlation in the future, other variables such as body temperature, cortisol levels, or a cognitive assessment of the participant's mental state may be beneficial.
In future, examining more editing, correction, and interpolation techniques for interbeat intervals may enhance the interpretability and quality of the P-P intervals obtained from commercially available wearables [44,[55][56][57]. This study found that published movement artifact reduction techniques did not significantly improve the quality of our data. As wearable technologies continue to become more advanced, future studies in this field would benefit from the use of improved hardware and more robust sensors. For example, PulseOn and Apple Watch, both commercially available wearable devices, use different strategies to improve the quality of their signals. PulseOn uses multiwavelength PPG to reduce the sensitivity to movement artifacts and ambient light disturbances [47,48], demonstrating 99.57% accuracy during sleep [47]. Considering the lack of peripheral vascular indicators to account for changes in PTT, the Apple Watch approach of directly acquiring R-R intervals using built-in or peripheral ECG sensors (Kardia Band, Alivecor) [58] is justifiable.

Conclusions
The objective of this study was to assess the validity of PRV measurements taken from a PPG sensor by comparing it with the HRV measurements taken from a portable ECG while individuals were engaged in activities of daily living and during sleep. Although PPG sensors demonstrated greater validity at night, overall concurrent validity was poor. HRV metrics pNN50 and LF/HF ratio were especially sensitive to errors in point-to-point accuracy. Increased editing via deletion improved the RMSE but had a small impact on R 2 . In comparing editing and deletion methods, screening for motion artifacts using accelerometry signals to remove error-prone signals was largely ineffective in improving HRV estimates. The best results were obtained under condition C (moving average method) at night, with the highest mean R 2 values. Overall, the findings from this study suggest that PRV is a poor surrogate of HRV under free-living conditions. Findings from this study indicate that advances in hardware and wearable technologies, such as multiwavelength PPG sensors, are warranted to unleash the potential of PRV to serve as a proxy measure for HRV.