During Run 2 the LHC achieved an outstanding performance (Image: CERN CDS)
CERN’s Accelerator Fault Tracking (AFT) system aims to facilitate answering questions like: “Why are we not doing Physics when we should be?” and “What can we do to increase machine availability?”
People have tracked faults for many years, using numerous diverse, distributed and un-related systems. As a result, and despite a lot of effort, it has been difficult to get a clear and consistent overview of what is going on, where the problems are, how long they last for, and what is the impact. This is particularly true for the LHC, where faults may induce long recovery times after being fixed.
The AFT project was launched in February 2014 as collaboration between the Controls and Operations groups with stakeholders from the LHC Availability Working Group (AWG).
The project was initially divided into 3 phases, with the 1st phase completed on time, ahead of the LHC restart (post Long Shutdown 1: 2013-2014) and delivering the means to achieve consistent and coherent data capture for LHC, from an operational perspective. Phase 2 of the project has been in progress during 2015-16 working on detailed fault classification and analysis for equipment groups. Phase 3 (pending) foresees extended integration with other systems e.g. asset management tracking to be able to make predictive failure analysis and plan preventive maintenance operations.
AFT helps various teams from around CERN, and output from the Web application regularly features in various machine coordination and operations meetings. Furthermore the AWG and various equipment group representatives are using AFT data and statistics to analyse the performance of their systems and target areas for improvement – as presented at various conferences and workshops  , and summarized in regular AWG reports   .
If a picture is worth a 1000 words, then take a look at the AFT Cardiogram (Figure 1) that displays LHC faults occurring in 2016 between Technical Stops 1 and 2, together with the machine activity data.
Figure 1 LHC Faults between 2016 Technical Stops 1 and 2
AFT allows representing relationships between faults such as child faults (represented in pink on the Cardiogram) and faults blocking the resolution of another fault. With such data it is possible to analyse availability from different perspectives such as raw system downtime, impact on machine availability (accounting for faults occurring in the shadow of on-going faults) and root cause analysis (assigning child fault downtime to parent faults). Figure 2 shows an example of such a comparison, for a specific sub-domain of systems displayed in the AFT Web application.
Figure 2 Comparison of Fault Time from different perspectives for LHC Technical Services
Other functionality includes: fault searching and data export with a workflow for fault follow-up by different experts. Like most data-centric systems, the value of the infrastructure and tools is always governed by the quality of the data, and so the role of the AWG – who regularly meet to ensure the completeness and correctness of the AFT data – shouldn’t be underestimated.
The technologies involved are a database to persist fault data, a Java server with ReST APIs for data exchange with the Operation team’s E-logbooks (and potentially other systems), and a dedicated Web application for data editing / visualization and analysis (shown in above screenshots).
The AFT system has been designed to be non-LHC specific, and therefore is able to cater for fault tracking for other accelerators if so desired. Due to the success of AFT for LHC during 2015, in 2016 the CERN Machine Advisory Committee proposed that AFT be used for CERN’s Injector Complex. As such, work has started in late 2016 to prepare for AFT usage in the Injector Complex from the start of 2017 operation at the end of March.