1. Detection diagnostic accuracy

1.1. Review question: What are the most accurate methods for detecting atrial fibrillation in people with cardiovascular risk factors for AF and/or symptoms suggestive of AF?

1.2. Introduction

Please see Evidence review A.

1.3. PICO table

For full details see the review protocol in Appendix A:.

Table 1. PICO characteristics of review question.

Table 1

PICO characteristics of review question.

1.4. Methods and process

This evidence review was developed using the methods and process described in Developing NICE guidelines: the manual.174 Methods specific to this review question are described in the review protocol in Appendix A:.

1.5. Clinical evidence

1.5.1. Included studies

Seventy four studies were included in this review.6, 7, 23, 24, 26, 36, 49, 58, 59, 63, 76, 77, 79, 82, 86, 90, 91, 96, 101, 104, 117, 123, 126, 128, 132, 133, 138, 140, 144, 145, 150, 153, 156, 160162, 164, 165, 171, 172, 177, 184, 186, 195197, 201, 208210, 214, 218220, 222, 233, 237, 240, 243, 253, 258, 265, 268, 271, 275, 278281, 283, 284, 286, 288, 295

The characteristics of these studies are summarised in Table 2 and Table 3, and evidence from these studies are summarised in the clinical evidence summaries (Table 4 to Table 14). Further details are available in the study selection flow chart in Appendix C:, sensitivity and specificity forest plots and receiver operating characteristics (ROC) curves in., and study evidence tables in Appendix D:.

Analysis was stratified by the gold standard used in the studies: 1)12 lead ECG interpreted by an expert (such as a cardiologist or electrophysiologist) or 2) ambulatory monitoring for >24 hours (such as Holter). This stratification was based on the AF that would be detected. 12 lead ECG should detect persistent AF but will only pick up paroxysmal AF during specific intervals of time, and is therefore only a valid gold standard for persistent AF. Ambulatory monitoring for >24 hours may be more useful at picking up AF of both persistent and paroxysmal types and so can be used as a more valid gold standard for any type of AF. Table 2 provides details of the reference standards used.

For each of the above separate strata, pre-hoc sub-grouping strategies (conditional on observed heterogeneity) for any diagnostic test meta-analyses were by


expertise of index test interpreter (automated reading / expert reader [such as cardiologist or electrophysiologist] / clinician [clinician such as nurse or GP that was not deemed to be an expert in analysis of ECG traces] / patient).


simultaneity of index and reference tests (yes/no)

Sub-grouping was only carried out for the ‘Alive Cor’ test because this was the only analysis where heterogeneity was evident and where there would be at least 3 studies in a sub-group. For the ‘AliveCor’ test, sub-grouping was carried out using the ‘expertise’ strategy and not the ‘simultaneity’ variable because there was evidence from the data that only the former sub-grouping variable could explain the original heterogeneity.

Only 6 diagnostic meta-analyses were possible because at least 3 studies are required for a valid pooling of results, and for most index tests only one or two studies were available. Where diagnostic meta-analysis was possible for a particular test, data from the same study that involved different interpreters were considered as separate data points. Such data were therefore entered alongside each other in the meta-analysis. This was necessary because expertise of examiners had been classified as a ‘sub-grouping’ (conditional stratification) variable rather than a ‘stratification’ (unconditional stratification) variable in the protocol. This meant that we could only stratify the meta-analysis by the expertise of interpreters if there was observable heterogeneity in the initial non-stratified analysis. This inclusion of more than one data point from the same study in the meta-analysis was not deemed to be ‘double-counting’ for two reasons. Firstly, the use of interpreters of different expertise was felt to make data points from the same study sufficiently ‘different’ to each other to the extent that they could be regarded as being from ‘different studies’ for the purposes of meta-analysis. Secondly, in many cases the samples of patients used for different interpreters within the same study were different or only overlapped partially.

In the vast majority of studies the unit of analysis was the person being tested, and if AF was detected once in that person then this was counted as a positive test result (regardless of how many times AF was detected in that person using that test) in the 2x2 table. This reflects the purpose of the tests – to find out if a specific patient has AF or not, and as soon as AF has been detected a diagnosis may be made. However in 5 studies153, 171, 218, 268, 275, the unit of analysis was each of many separate measures done on each person (person-measures). Thus, if AF was detected on several occasions on one person, each event was considered a separate positive test. Since this may influence the strength of overall results, care should be taken with interpretation of these results. Therefore, where such results occur this has been highlighted (sections 1.5.6 and 1.5.7).

Most studies did not include the exact protocol population. For example, some studies contained people without symptoms suggestive of AF. Such studies were included with a quality downgrade for ‘indirectness’, as stated in the protocol. This flexibility was useful because very few studies were available that exactly met the protocol’s population requirements. Furthermore, it was felt that the sensitivity and specificity of the devices would not be greatly influenced by variations in population characteristics, as it was felt implausible that any of these varying characteristics could significantly affect how easy it is to detect AF. It was accepted that different populations would have different prevalence of AF, and that this would therefore affect positive and negative predictive values. However, rather than to directly evaluate predictive values, the clinical aim of this review was to assess the sensitivity and specificity of tests, which independently measure their clinically important ability to differentiate people who have and who don’t have the condition. Nevertheless, it was recognised that positive and negative predictive values are of great importance to health economic analysis, and so these will be calculated from the sensitivity and specificity data from the studies in conjunction with established UK prevalence rates (rather than the prevalence rates in individual studies) if tools are found with strong evidence of adequate sensitivity and specificity. Similarly, although ‘screening’ is outside the remit of this review, diagnostic papers with a reference to screening were included if they contained useful data on the accuracy of tests. The rationale for this is that the determined accuracy of a single device would be similar, whether it is part of a screening strategy or not.

Finally, there were some features of some of the data that should be clarified.

  1. Occasionally, papers reported some data from the index test as unclear, and varied in whether they designated this as ‘AF’ or ‘non-AF’. For the purposes of this review, any such data were designated ‘non AF’, regardless of how the paper designated the data. This approach was taken because this review is about detection of AF. If a data point is unclear then AF cannot be said to have been detected, so in a binary classification system it can only be designated ‘non-AF’. However, if unclear data in papers were only designated as AF, and there was insufficient information in the paper to allow re-calculation, those data were used.
  2. Sometimes a paper might have several index test interpreters who were at the same level of expertise (for example cardiologist 1, cardiologist 2, etc.) but their data were considered separately. In such cases only the first reported observer was included in this review, to avoid ‘double counting’ of similar data.
  3. Destegne, 201758 provided data for a sample including people with pacemakers or implanted cardiac monitors, as well as data for a sample with such people excluded. The latter sample was used for this review as people with pacemakers or implanted cardiac monitors were not part of the population in other studies, and had a significant effect on results

1.5.2. Excluded studies

Please see the excluded studies list in Appendix H:.

1.5.3. Summary of clinical studies included in the evidence review (Gold standard = 12 lead ECG stratum)

Table 2. Summary of studies included in the evidence review for detection of atrial fibrillation.

Table 2

Summary of studies included in the evidence review for detection of atrial fibrillation.


1.5.5. Summary of clinical studies included in the evidence review (Gold standard = >24 hours ambulatory monitoring stratum)

Table 3. Summary of studies included in the evidence review for detection of atrial fibrillation.

Table 3

Summary of studies included in the evidence review for detection of atrial fibrillation.

See Appendix D: for full evidence tables.

1.5.6. Quality assessment of clinical studies included in the evidence review

For measurement of imprecision, clinical decision thresholds for sensitivity and specificity were set at 0.90 and 0.60.

STRATUM 1: 12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard
Table 4. Clinical evidence summary: diagnostic test accuracy for mobile ECG devices (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 4

Clinical evidence summary: diagnostic test accuracy for mobile ECG devices (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not (more...)

Table 5. Clinical evidence summary: diagnostic test accuracy for blood pressure monitors (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 5

Clinical evidence summary: diagnostic test accuracy for blood pressure monitors (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were (more...)

Table 6. Clinical evidence summary: diagnostic test accuracy for pulse palpation (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 6

Clinical evidence summary: diagnostic test accuracy for pulse palpation (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not available (more...)

Table 7. Clinical evidence summary: diagnostic test accuracy for photoplethysmography (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 7

Clinical evidence summary: diagnostic test accuracy for photoplethysmography (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 8. Clinical evidence summary: diagnostic test accuracy for 3-lead tele ECG (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 8

Clinical evidence summary: diagnostic test accuracy for 3-lead tele ECG (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not available (more...)

Table 9. Clinical evidence summary: diagnostic test accuracy for 6 lead ECG (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 9

Clinical evidence summary: diagnostic test accuracy for 6 lead ECG (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not available (more...)

Table 10. Clinical evidence summary: diagnostic test accuracy for other non-12 lead ECG (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 10

Clinical evidence summary: diagnostic test accuracy for other non-12 lead ECG (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were (more...)

Table 11. Clinical evidence summary: diagnostic test accuracy for 12 lead ECG interpreted by automated algorithm or non-expert interpreters (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard).

Table 11

Clinical evidence summary: diagnostic test accuracy for 12 lead ECG interpreted by automated algorithm or non-expert interpreters (12 lead ECG interpreted by expert cardiologist/electrophysiologist as gold standard). Where 95% CIs are provided in round (more...)

1.5.7. Quality assessment of clinical studies included in the evidence review

STRATUM 2: >24 hour ambulatory monitoring [such as Holter] as gold standard
Table 12. Clinical evidence summary: diagnostic test accuracy for blood pressure monitors (>24 hour ambulatory monitoring as gold standard).

Table 12

Clinical evidence summary: diagnostic test accuracy for blood pressure monitors (>24 hour ambulatory monitoring as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not available and Forest Plots (more...)

Table 13. Clinical evidence summary: diagnostic test accuracy for <7 day Holter devices (7 day Holter as gold standard).

Table 13

Clinical evidence summary: diagnostic test accuracy for <7 day Holter devices (7 day Holter as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not available and Forest Plots or pooled analyses (more...)

Table 14. Clinical evidence summary: other longer term devices (>24 hour ambulatory monitoring as gold standard).

Table 14

Clinical evidence summary: other longer term devices (>24 hour ambulatory monitoring as gold standard). Where 95% CIs are provided in round brackets (or no 95% CIs are given), raw data were not available and Forest Plots or pooled analyses were (more...)

1.6. Economic evidence

Please see evidence review A.

1.7. The committee’s discussion of the evidence

1.7.1. Interpreting the evidence The outcomes that matter most

For the diagnostic accuracy review, the outcomes were sensitivity and specificity. For a test that is suitable to be used alone as a definitive diagnostic test (in place of 12 lead ECG), both sensitivity and specificity are of equal value, as a definitive test needs to have almost perfect sensitivity and specificity. High sensitivity is essential to avoid people with true AF being missed and therefore untreated, as this can lead to serious sequelae such as stroke. High specificity is equally important to prevent people without AF being misdiagnosed as having it, which may lead to unnecessary prescription of anticoagulants, antiarrhythmic drugs, or invasive procedures, all of which carry a burden of serious adverse effects.

In contrast, for tests that might be used as the first part of a two stage testing process (an example of such a two stage process is pulse palpation followed by 12 lead ECG in people who test positive) then sensitivity may be more important than specificity. Reasons for this are as follows. In a two-test scenario, the initial test is used as a filter to decide who goes on to the resource-intensive 12 lead ECG, and this could be achieved by either an extremely sensitive initial test or an extremely specific initial test. With a highly sensitive initial test, only initial positives go on to the next stage of testing (where the false positives resulting from the sub-optimal specificity of the initial test can be ‘weeded out’ by 12 lead ECG). Initial negatives can be safely discarded from the diagnostic process when the initial test has high sensitivity, because very high sensitivity means that the initial negatives should contain hardly any people with true AF. In contrast, with an extremely specific initial test, only initial negatives go on to further testing (where the false negatives resulting from the sub-optimal sensitivity of the initial test can be ‘weeded out’ by 12 lead ECG). The initial positives can be regarded as diagnostic in the presence of high specificity because high specificity means that almost all initial positives will have true AF. Because there are likely to be fewer initial positives than initial negatives, using a highly sensitive test is likely to lead to fewer people going on to the 12-lead test than use of a highly specific test. A highly sensitive initial test is therefore the preferable option for a two-stage process because the purpose of a two-stage process is to limit use of the resource-intensive 12 lead ECG.

Positive predictive value (PPV) and negative predictive value (NPV) are important for health economic considerations but are less important for evaluating clinical utility, and are often unreliable when calculated from study data as they are dependent on the prevalence which may not always be representative in studies. The aim had been to calculate PPV and NPV for any tools that had good evidence of adequate sensitivity and specificity in relation to an agreed prevalence rate of AF. However, this was not carried out because no tools were identified.

For the RCT review, outcomes were quality of life, mortality, stroke and thromboembolism, Major bleeding, all cause hospitalisation, confirmed diagnosis of AF and initiated anticoagulants for AF. All were regarded as critical by the committee, but quality of life, stroke/systemic embolism, mortality, and confirmed diagnosis of AF were deemed the most relevant for decision-making. These were prioritised over other critical outcomes because ‘quality of life’ was felt to provide the most comprehensive measure of benefit to the patient, ‘stroke and systemic thromboembolism’ was regarded as the major serious complication of AF, ‘mortality’ was felt to best characterise the harms of treatment, and ‘confirmed diagnosis of AF’ was thought to best characterise the benefits of treatment. The quality of the evidence

For the diagnostic accuracy evidence, most data were rated as at serious or very serious risk of bias, because of a lack of simultaneity between index and reference tests, and because of a lack of blinding in some studies. Indirectness was also often rated as serious because the populations in studies differed from the protocol definition. Overall, most data were rated as low or very low. For the RCT evidence, a similar picture existed. Serious or very serious risk of bias was largely due to issues around selection and attrition bias, and again indirectness of populations was a major issue. Outcomes were therefore mostly rated as low or very low. Benefits and harms

The diagnostic accuracy data for the different index test devices in relation to the gold standard of 12 lead ECG interpreted by a cardiologist/electrophysiologist were initially discussed. These devices included mobile ECG devices, HR monitors, blood pressure measurements, photoplethysmographic technique, pulse palpation, other ECG measures and 12 lead ECG not interpreted by an expert. The sensitivity and specificity of the majority of these devices were regarded by the committee as insufficiently high to permit their use as a single diagnostic test. Some devices, such as HR monitors, BP devices or plethysmographic devices, did approach 100% sensitivity and specificity, but these had often been tested in small samples leading to imprecise estimates. Alternatively, such estimates were from large but solitary studies. The committee noted that accuracy differed quite widely between different studies looking at the same test and they were therefore unable to make recommendations based on results from single studies.

Having decided that none of the tests could be used as an individual (definitive) diagnostic test, the committee discussed whether any of the tests could be used as a first-line test, prior to 12 lead ECG (please see ‘outcomes that matter the most’ above for an explanation of this process). The committee realised that such tests would need perfect or almost perfect sensitivity to avoid losing some people with AF from the diagnostic process (with enough specificity to allow a worthwhile reduction in the number going on to 12 lead testing compared to 12 lead testing used alone). The current recommendation is to use pulse palpation as the initial test, and thus an alternative test would need to have clear superiority in sensitivity over pulse palpation (with similar specificity) to justify replacement of pulse palpation. Some of the devices had sensitivity point estimates that exceeded those of pulse palpation, with upper 95% confidence intervals that extended closer to maximal sensitivity than those for pulse palpation. This provided weak evidence that some of the devices might be of greater use as a first line test than pulse palpation. However, the confidence intervals of the devices overlapped with those of pulse palpation, demonstrating a level of uncertainty about such superiority in the population. The committee were of the opinion that this level of uncertainty was insufficient to change the established practice of pulse palpation, which is a core clinical skill in widespread use, and which is extremely quick and low-cost to carry out. However, they felt that new devices had promise, which might be manifested in further high-quality research, and so a research recommendation was proposed, alongside a continuation of the current recommendation.

It is important to note a subtle change to the recommendations regarding the definitive test to be used if pulse irregularities are observed. In the previous guideline the recommendation had been to use ‘ECG’ as the definitive test, whereas in the present guideline we are specifying ‘12-lead ECG’ as the definitive test. This change was noted by the committee to be very important to prevent non-12 lead ECG such as lead I devices (which this review has shown to be lacking in adequate accuracy compared to 12 lead ECG) being used as the definitive test.

The diagnostic accuracy for the devices tested in relation to a longer-term gold standard (>24-hour ambulatory monitoring) were also considered by the committee. This evidence was regarded as particularly important as it was the only evidence able to inform the accuracy of detection of paroxysmal AF (12 lead ECG usually lasts only 10 seconds and so whilst it is perfectly good as a gold standard for detecting persistent AF it is often inadequate for detecting paroxysmal AF). The committee again noted that the evidence did not suggest that any specific test or device should be recommended but did note that the evidence clearly demonstrated that the accuracy of detection increased with the duration of testing. Therefore, the committee recommended that testing for suspected paroxysmal AF should be continued for as long as possible by any form of continuous or loop monitoring.

The committee agreed that the RCT review did not offer particularly useful evidence to inform recommendations, over and above the data provided by the diagnostic accuracy review. In particular, the committee highlighted that the follow up periods of the included studies were too short to allow a meaningful picture of downstream clinical outcomes. The RCT review was also noted to have serious gaps in terms of many of the available tests not having been studied.

1.7.2. Cost effectiveness and resource use

One cost-utility analysis was identified comparing single time point lead-I ECG devices with manual pulse palpation (MPP) followed by a 12-lead ECG in primary or secondary care for the detection of AF in people presenting to primary care with signs or symptoms of AF and who have an irregular pulse. This cost utility analysis was conducted as part of the NICE Diagnostic Guidance DG35 published in 2019 for lead-I devices. The study found that in all base case scenarios (these varied the time to and location of confirmatory 12 lead ECG) Kardia mobile, where treatment for AF is initiated following a positive result, ahead of confirmatory 12-lead ECG test, was the more cost-effective than the standard diagnostic pathway where no treatment is initiated until 12 lead ECG testing is complete. Furthermore, Kardia Mobile dominated (less costly and more effective) all other lead-I devices included in the analysis. This study was partially applicable as it did not include all comparators in the protocol for this question. There were potential serious limitations, primarily due to the fact the sensitivity and specificity data used in this analysis was from studies conducted in asymptomatic patients, and so this was indirect evidence. Furthermore, the economic evaluation is only relevant to primary care practices where patients have to wait at least 48 hours between an initial consultation with the GP and a 12-lead ECG.

In addition to this study, unit costs for different methods of detecting AF were presented, including current practice that is manual pulse palpitation followed by 12-lead ECG in those with an irregular pulse. The committee noted that although the lead-I devices do not appear particularly costly per use; they may add a significant resource burden in terms of the need for expert interpretation. This would either require training of GPs or would necessitate sending lead-I results to cardiologists for guidance and advice.

The committee considered the published health economic analysis alongside the clinical evidence and concluded that there was insufficient direct evidence to support replacing the current methods of detecting AF. In particular, the health economic evidence is based on indirect clinical evidence and there is uncertainty as to whether the sensitivity and specificity can be translated from an asymptomatic to a symptomatic AF population. This is in line with the guidance from DG35.

Overall, therefore the committee have kept the previous recommendations, only adjusting the wording to make these clearer. As they represent current practice, no resource impact is anticipated.

1.7.3. Other factors the committee took into account

The committee noted that the benefit of anticoagulation for asymptomatic AF that has not been documented on 12 lead ECG is uncertain and research is currently being conducted.

The committee noted that the use of hand-held devices could improve diagnosis in people who find it impossible or difficult to access EEG services, for example people in care homes.

The committee acknowledged the importance of primary care networks, including nurses and pharmacists, in the detection of AF in the community.

The committee highlighted that stroke prevention is one of the five programme work streams in the National Stroke Programme, which underpins the Long Term Plan with actions specifically around better diagnosis and management of AF. Integrated Stroke Delivery Networks have been set up across England to deliver on these commitments locally, and to implement improvements across the pathway at a regional level and should support efforts to improve AF detection and management.

The committee noted the challenges to delivery of healthcare in the context of COVID-19. Alternatives to face-face consultations should be explored with additional support to help people manage their condition. To mitigate against the current obstacles to in-person AF detection and management, NHS England and Improvement’s NHS at Home initiative, for example, aims to support people to remote monitor their health conditions and to use technology to allow clinicians to monitor their conditions remotely.

The committee noted that opportunistic screening was outside of the remit for this guideline.


Appendix B. Literature search strategies

This literature search strategy was used for the following reviews:

  • What are the most accurate methods for detecting atrial fibrillation in people with cardiovascular risk factors for AF and/or symptoms suggestive of AF?

The literature searches for this review are detailed below and complied with the methodology outlined in Developing NICE guidelines: the manual.174

For more information, please see the Methods Report published as part of the accompanying documents for this guideline.

B.1. Clinical search literature search strategy (PDF, 323K)

B.2. Health Economics literature search strategy (PDF, 304K)

Appendix C. Clinical evidence selection

Figure 1. Flow chart of clinical study selection for the review (PDF, 162K)

Appendix D. Clinical evidence tables

Download PDF (1.4M)

Appendix E. Coupled sensitivity and specificity forest plots and sROC curves

Download PDF (383K)

E.1. ROC curves (PDF, 240K)

Appendix F. Health economic evidence selection

Figure 112. Flow chart of health economic study selection for the guideline (PDF, 275K)

Appendix G. Health economic evidence tables

Please see evidence review A.

Appendix H. QUADAS2 risk of bias assessment

Download PDF (394K)

Appendix I. Excluded studies

I.1. Excluded clinical studies

Download PDF (267K)

I.2. Excluded health economic studies


Appendix J. Research recommendations

J.1. Detection of persistent AF (PDF, 187K)


Diagnostic evidence review

Developed by the National Guideline Centre, Royal College of Physicians

Disclaimer: The recommendations in this guideline represent the view of NICE, arrived at after careful consideration of the evidence available. When exercising their judgement, professionals are expected to take this guideline fully into account, alongside the individual needs, preferences and values of their patients or service users. The recommendations in this guideline are not mandatory and the guideline does not override the responsibility of healthcare professionals to make decisions appropriate to the circumstances of the individual patient, in consultation with the patient and, where appropriate, their carer or guardian.

Local commissioners and providers have a responsibility to enable the guideline to be applied when individual health professionals and their patients or service users wish to use it. They should do so in the context of local and national priorities for funding and developing services, and in light of their duties to have due regard to the need to eliminate unlawful discrimination, to advance equality of opportunity and to reduce health inequalities. Nothing in this guideline should be interpreted in a way that would be inconsistent with compliance with those duties.

NICE guidelines cover health and care in England. Decisions on how they apply in other UK countries are made by ministers in the Welsh Government, Scottish Government, and Northern Ireland Executive. All NICE guidance is subject to regular review and may be updated or withdrawn.

Copyright © NICE 2021.
