How Accurate are Wearables?

I write about wearables and surrounding concepts fairly regularly and am certainly a technophile, often being a relatively early adopter. In fact, I also do some beta testing for companies at times and help them in their product development given my past. Despite this though, I am very cautious with ‘trusting’ wearables. The companies certainly play their role in the belief that metrics they output are gospel but so does a blind trust in them and their metrics. This type of error is what prompted my recent writing about biomarkers and understanding their derivation to avoid misinterpreting bad data (I’d argue bad data is often worse than no data). All this said I am not a cynic in this space, but more a realist with an optimistic and hopeful support for people trying to build things (it’s hard!).
This is all as a preface to an article about the accuracy of wearables following a recently published umbrella review on the topic. Remember, the goal is for these tools to be helpful and additive in your experience and goals, not a burden or obligation. If you feel you can’t do the activity without them - you need time away from them. The data is just that, a single source of data, your perception and experience is another that shouldn’t be downplayed, but rather calibrated with the objective data.
The paper that prompted this piece is an umbrella review of systematic reviews. I will spare you the research methodology lesson, but in short; studies are done, similar ones are grouped into systematic reviews and then these were grouped for an umbrella review. This publication is particularly cool because it is “living” and thus will be updated (or at least that’s the intention).
How Much Data is There?
In short, an absolute bucketload (a metric bucketload, not imperial for the record).
The paper included 24 systematic reviews (this is after culling the ones that didn’t meet criteria for inclusion). These reviews subsequently included 249 studies (again, these would have removed studies that didn’t meet criteria) after duplicate studies were culled by the umbrella reviewers (because when looking at systematic reviews, it’s likely a single article could be in multiple systematic reviews). This yielded a total of 30,465 participants, 43% of which were female (an astonishingly high number to me - I was pleasantly surprised as I would have anticipated about half of this).
So this is a meaty thing, which should give us some indication of how thorough it is, though it should be remembered that conclusions from it are segmented as is the data. For example, insights gleaned on HRV data are unlikely from all 249 studies/30465 participants, they’re almost certainly from a portion of this (probably a relatively small one too).
How Many Wearables are Scientifically Validated?
The short answer is probably what the informed reader would intuit; not many. The article quotes as few as 11% having at least one biometric validated, which insinuates some have more than one which is, frankly, somewhat surprising. Further to this, given many wearables have multiple biometrics, the article states this actually represents only 3.5% of biomarkers being validated.
Why Don’t More Wearables Get Validated?
To be honest, there’s not a lot of incentive for them to do so. It’s a low return on investment and high downside risk. Let me explain.
Getting validated is a relatively lengthy process. It will vary from biomarker to biomarker but if consulting and asked the question I’d advise a company to anticipate it taking at least a year to get published (this is optimistic, it’s probably longer).
It isn’t cheap. At best you’d be looking at providing free wearables to be used in the study but there’s also the potential that employee time would be needed, not to mention potential funding.
It may or may not increase sales. Ultimately this is the goal for the company and it may not be achieved. A negative result may hurt sales, or at best not help. A positive result may or may not move the needle. The most accurate wearable in many areas is not often a decision making factor for people.
If you don’t believe me just look at the signal from the market, one example is Ultrahuman’s ring - it is not validated independently (that I have read and apologies if I am wrong here - please do share the research with me if so) but they have just announced an 18 carat gold version.
Some things don’t need validation. Whilst it would be nice to have things like heart rate validated in all wearables, generally if people are concerned they’d opt for a secondary device to measure heart rate eg paring a watch with a heart rate monitor. Having said this, validation is required in certain circumstances and as such this is a huge driver. An example of this is arrhythmia detection in wearables like the Apple Watch. This validation is necessary given its implications, BUT also brings a level of endorsement of the feature (and thus wearable). So you can see the calculus here.
Gold standards are hard at times. I have written this previously in the above-linked biomarkers article (hat tip to Marco Altini on that one), but a gold standard measure may not exist, and if it does, it may not be feasible to test given the constraints.
So, How Accurate Are Wearables?
Heart Rate
One of the reviews that was included in the umbrella review (remember; studies were done, they were then grouped for systematic reviews and the paper I am referencing is an umbrella review of those systematic reviews) found an underestimation of heart rate by 3.39bpm. Further analysis suggested there was an underestimation for cycling when compared to activities of daily living and that treadmill walking was a little more accurate.
Herein lies some learning; generally this software is built on data, tuned and even perhaps validated, in labs. These environments are quite sterile and may not represent the reality of measurement in the wild.
In two of the other reviews included in the umbrella review, one found a measurement error of ± 3% (testing multiple devices) and another found a mean absolute percentage error of <10%.
• Data Volume: 165 non-duplicate studies of 5816 participants ~ 50% female
• Take Home: Pretty accurate, measurement location and device capturing the data matters. Likely close enough for any use someone would have. More on heart rate, related metrics and measurement of it here.
Heart Rate Variability (HRV)
The reviews in this realm found that there was a high correlation between measures at rest and the gold standard. This was as high as 98-100% and in the second review a little lower at 85-99%. It was below 85% when exercising or moving. Remembering HRV can be very noisy data.
• Data Volume: 22 non-duplicate studies of 714 participants ~ 32% female
• Take Home: Accurate if measurement is done properly and capture is appropriate (Marco Altini’s Substack is the place to learn about intricacies, it is linked above).
Arrhythmia Detection
Great sensitivity (people with the condition are said to have it - low false negatives) and specificity (people without the condition are said to not have it - low false positives). Specifically, 100% (sensitivity) and 95% (specificity), as result pooled accuracy was quoted as 97%.
There were other varied results from another systematic review included, where they utilised smart watches and bands, the big difference being algorithms. Part of this becomes about tuning the algorithm properly and the user experience. Specifically, sometimes specificity and sensitivity need to be traded off, in which case there is a different tolerance or appetite for one over the other. Is it better to have more people incorrectly worried or more people incorrectly reassured? - these are the decisions being balanced.
• Data Volume: 85 non-duplicate studies of 877,127 participants ~ 42% female
• Take Home: Arrhythmia detection, provided the signal is good, is quite straight forward. Mostly because it is really a specific arrhythmia (atrial fibrillation) being detected and it’s a relatively identifiable one (even for medical students).
VO2 Max (aka Aerobic Capacity)
Methods are key in this context. A meta analysis in the umbrella review showed wearables using resting measures were ± 15.24% on the actual value. While those using exercise tests were ± 9.83% on the true VO2Max.
• Data Volume: 14 studies of 403 participants 45% female
• Take Home: Exercising measures are more accurate than resting measures. None are particularly good. If you want to know your actual VO2Max get tested.
Blood Oxygen Saturation
Mean absolute differences were quoted as 2% with some reports of up to 5%.
• Data Volume: 5 studies of 973 participants 44% female
• Take Home: A few variables can impact this, include sample site and individual differences. Given the small range of measures across which a concern could be raised (there’s a meaningful difference between 97% and 93%), using a finger based measure makes sense. Especially as they’re relative cheap, ease to use and transportable. The range of use cases for this is pretty minimal too.
Step Count
Generally wearables were found to underestimate step counts but results ranged from around -9% to 12% error in measurement.
• Data Volume: 184 non-duplicate studies of 6197 participants ~ 53% female
• Take Home: As discussed in my article on walking, there are many simple sources of error in these metrics. As such, step counts may be an indication of activity level but they will not be strictly accurate.
Energy Expenditure
Perhaps unsurprisingly for those who read my article on calories, wearables were not particularly good at estimating energy expenditure. Errors ranged from − 21.27% to 14.76%.
• Data Volume: 218 non-duplicate studies of 7734 participants ~ 54% female
• Take Home: Not worth taking significant note of. Perhaps helpful as a rough guide to compare between similar training sessions at best when using the same wearable.
Sleep
Generally wearables were found to overestimate sleep, in the magnitude of more than 10%.
• Data Volume: 95 non-duplicate studies of 2985 participants ~ 54% female
• Take Home: Type and thus algorithm really matters here. I touched on personality types and potential nocebo with sleep trackers in this article. Significant differences in total sleep time may be meaningful, sleep stages are much less accurate than sleep/wake.
When it comes to wearables, understanding the way they are measuring and sources of error may help navigate some inaccuracies. Broadly, however, it is worth considering why you want to track something and your personality, to ensure your relationship to the wearable device is a healthy one. These devices can be incredible vessels for behaviour change and health & performance improvements as a result, but there’s no guarantee.
If this article interested you, click some of the links for other articles I have written that I have linked to.
This article was written by Dr. David Lipman. Dr. Lipman holds degrees in exercise physiology, podiatry, and medicine, and brings a diverse background spanning elite sports coaching, clinical work, university teaching, and startup consulting, all fueled by his passion for exercise and writing as tools for learning and impact. Check out more of his articles here.