## Status and Plans for PXD DAQ, Monitoring and SlowControl

## TRG/DAQ Workshop

29.11-2.12.2022, Nara

Björn Spruck

#### **PXD DAQ Overview Phase3**

- Detector Introduction
  - Technology and Readout
- DAQ
  - Components, ROI selection, HLT feedback
- "Slow" Control and Monitoring
- Beam abort Emergency off
- Data taking issues
- SEUs
- This talk will not cover details of module performance (efficiencies, HV currents, noise, alignment etc)

### **Combining Vertex Detector (One Half Shell)**



#### **DEPFET Pixel Detector Concept**

- Depleted P-channel Field-Effect Transistor pixels on fully depleted silicon bulk
- Fast charge collection (~ns) into internal gate
- Readout current is modulated by collected charge
- Internal amplification, large Signal-to-Noise
- Gate must be cleared after readout
- Low energy consumption and heat dissipation







#### PXD Module



#### **PXD** Module



#### **PXD Sensors**

- Mechanically self-supporting 75µm thin sensors
- Pixel size down to 50x55µm<sup>2</sup>
- **Rolling shutter read-out**  $\rightarrow$  low power
- $50 \text{kHz} \rightarrow 20 \ \mu \text{s}$  integration time
- Design: 1% occupancy in layer 1
- 3% occupancy limit (DHP, DAQ, tracking)
- Rad. hard sensor and ASICs
- 40 sensors, 250x768 pixels each
- Power is dissipated mainly in the ASICs at the end of stave

DHP

correction

Digital processing Zero suppression

Trigger and timing

• 2 phase CO<sub>2</sub> cooling









#### **A Little Bit of History**

- Phase 3 (2019-2022)
  - Currently only inner layer + 2 outer ladders due to production delay due to low yield in ladder assembly
  - Full PXD replacement installation in LS1/2023
  - No significant change for DAQ/SC, full PXD DAQ chain/interfaces already installed/tested in 2019



#### **Module Control/DAQ System Layout**



#### **Concept of PXD Data Reduction**

- Challenge: PXD unfiltered raw data rate  $\rightarrow 10x$  that of other Belle II detectors
- Most data is from (not-triggered) background  $\rightarrow$  Data reduction needed
- Concept:
- Read out all triggered events
  - Store them in some buffer until HLT has decided
  - When HLT rejects event, scrap it
  - When HLT accepts event, use track intercepts (ROIs) to filter PXD data before sending it to Event Builder2
- Buffering and filtering happens on the "ONSEN" system
- Cope with unordered event sequence due to varying HLT latency
- Per definition: HLT output seq = ONSEN output seq = EB2 input seq (from HLT and ONSEN)



#### **PXD DAQ Scheme**



- PXD unfiltered raw data rate  $\rightarrow$  10x that of other Belle II detectors
- Separate readout path
- Remove data not belonging to tracks before storage
- Data reduction in size: 1/10 by High Level Trigger based "Region Of Interest" calculation from CDC and SVD track information
- Data reduction by "rate": full event rejection from HLT (assumed 1/3 at design time)
- Feedback from HLT to PXD readout and selection of pixels within rectangular ROIs
- ROI calculation on HLT is always on but filtering is currently turned off as data rates are still low

#### **Online ROI Selection - Remarks**

- ROI selection needs reliable ROI calculation on High Level Trigger (alignment) as any data outside ROIs will be lost.
- Select pixels within rectangular regions. Implementation of cluster based selection has been finally dropped for current PXD-DAQ scheme.

typical hit map occupancies during spring  $\frac{2019}{2022}$  run with online calculated ROIs. Now (2022) hit map more busy  $\rightarrow$  see later slides



#### **PXD DAQ – Main Data Flow**

- Module (DHP ASIC)
- DHH
- ONSEN
- (HLT/DATCON  $\rightarrow$  ONSEN not shown)

- 160 opt links DHP  $\rightarrow$  DHE
- 32 opt links DHC  $\rightarrow$  ONSEN
- 32 Gbit Eth links  $ONSEN \rightarrow EB2$
- 1Gbit Eth HLT  $\rightarrow$  ONSEN



#### **DHH Setup and Scheme**

- Different boards involved
- DHI module control (JTAG, TRG)
- DHE data readout, trigger matching
- DHC data concentration
- Optimization
  - Outer layer < inner layer ; phi dependence:
    - → Load balancing (2 inner + 3 outer modules in one DHC), reduces mean rate on DHC output. Sub event building and distributing events on links.
    - → reduce number of links/boards for ONSEN and EB2
  - But: ONSEN and EB2 need to cope with that concept, too
  - System need to know which even data is on which selector and on which link which event arrives





#### **ONSEN and DHH - DAQ Hardware**

- Based on ATCA standard
  - Self-developed: ATCA Carriers with AMC cards which do the actual work. Rear Transition modules (RTM) for connections (DHH) and programming/debugging (ONSEN)
- Common monitoring by IPMI





Fig. 4. Layout of the AMC module used for the DHE and DHC cards.

DHH: Top of Belle



ONSEN: EHut



Advanced Telecommunication Computing Architectures (ATCA) Shelf



Compute Node Carrier Board (CNCB) 32+1 AMC (Virtex 5)



Advanced Mezzanine Card (AMC)



bjoern.spruck@belle2.org

#### **Slow Control and Monitoring**

- PXD Slow Control uses EPICS
- Interfaces to IPBus, IPMI, UNICOS, NSM2, ...
- F.e. IBBelle integration, DHH/ONSEN hardware monitoring by IPMI, etc
- Huge system (number of PVs)
- Configuration stored in our PXD ConfigDB
  - Sophisticated sequences for powering the modules (ASICs), interchange of power up and configuration of ASICs
- PXD EPICS Archiver(s)
- Logging: DB with Elasticsearch, Elog, Rocket.Chat
  - (independent of the interface to ES/Kibana)
- Control and Monitoring GUI
- Control System Studio CSS (provided configured package including all needed plugins)
- Using Elasticsearch log plugin, Alarm panels, etc
- Alarm System (BEAST)



#### **Slow Control Organization**

- Beside plain EPICS ioc we use many python based IOCs which for shifter operation, monitoring, LocalDAQ / analysis and all calibrations.
- Slow Control IOCs run on pxdioc server as (auto-)started system services within a screen while logging automatically is send to our Elasticsearch database
  - IOCs are available as rpm packages for easy and defined versioning and installation with the system package manager. rpms are compiled from the repository on stash.
  - A few IOCs run on dedicated hardware or machines directly connected to the hardware
- Python based IOCs which depend on our "labframework" libraries are started by system service within a screen on the LocalDAQ machine
  - LocalDAQ, calibration, shifter tools (end of run elog, shifter login, shift report, jira ticketing)

#### **PXD** Alarm System

- PXD used the BEAST alarm system integrated in CSS (Control System Studio)
- Alarm panel, table & hierarchical alarm tree
- Latching alarms
- Required shifter acknowledge
- Embedded actions (buttons) & guidance text
- Acoustic voice announcements
- Database for configuration
  - Allow complex condition for alarms
    - e.g. n times within/outside given limits
- Alarm messages (together with other log messages) pushed to ElasticSearch database and RocketChat

#### https://github.com/ControlSystemStudio/cs-studio/wiki/BEAST





- Pedestals upload or retaking pedestals main contributor to PXD down time
- Pedestals upload is necessary if noise in PXD modules changes → more often than anticipated
  - Noisy areas change (→ pedestals change) esp after power cycle or emergency off (by diamond beam abort)
- Changes only visible after run has been started
- Pedestal taking can only be done in PEAK and when no run is ongoing.
- Tricky because it interfaces with Run Control, HV Control and LocalDAQ. Configuration need to be changed twice and local triggering need to be enabled. Fully automated now.
- Pedestal taking automated already since 2019
- Automatic upload if calibration was requested
- Minimal shifter interaction (single button + restart a run)
- Preventive measures to reduce down times
- Take & upload new pedestals power cycle to avoid STOP/START after few minutes run time
- Automatic upload new pedestals if too old >4h
- Delay taking pedestals by 20s after Ramp Up (settle time)
  - $\rightarrow$  more stable pedestals, less noise

Pedestal taking happens automatically at each run STOP (~3s), but uploading takes more time (~50s)

Automatic or on demand (issued by PXD shifter)

Delicate: Runs fully transparent and parallel to DAQ and main RunControl



Avoid uploading bad pedestals, esp if taken in bad condition (eg after SEU)

#### **PXD Shifter GUI**

- Shifter Login, Shift Report
- Calibration as on-button action (simplification: calibrate always all modules)



#### **Interlock, Emergency Shutdown, Trips**

- Beside VLHI interlock (cooling, magnet humidity, ...), Diamond and CLAWS can trigger an emergency shutdown of the detector. Introduced in late 2019 after first serious damage to PXD.
- Override by software (and hardware buttons) is possible and needed to have detector in STANDBY during beam operation. Override is automatically set by HV control sequencer.
- A trip of single modules is an emergency off issued internally by the PS unit, which is triggered by the over-voltage protection.
- A trip of single modules will recover automatically
- $\rightarrow$  SEU identified as reason for triggering OVP



| STANDBY Global PSC                | Current state: OFF    | OFF STANDBY   | PEAK           | Module Tools >      | PXD ACCESS        |
|-----------------------------------|-----------------------|---------------|----------------|---------------------|-------------------|
| OFF PXD PSC                       |                       |               |                | Skip DHP link check |                   |
| mmit ID: 544 Select               | override DCU OFF      | - FO CE       |                | Skip ACMC           | Skip Offset Uploa |
|                                   | override CLAWS ON     | F ACIN        | HV IOCK        |                     | w-Seq Upload      |
| ner-Forward Inner-Backward Ou     | ter-Forward DHC / DI  |               | and modul      | e mask              |                   |
| Set Follow Manual req<br>Recovery | uest/ DHH Status      | Powe          | rSupply Status |                     |                   |
| 1011  Follow Global OFF           | DHE > OFF Entered Sta | ate_OFFPS >   | OFF Unit is    | disabled.           | Temp: Auto .      |
| 1021  Follow Global OFF           | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto        |
| 1031  Follow Global OFF           | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto        |
| 1041 🗌 Follow Global 🛛 OFF        | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto        |
| 1051 🗌 Follow Global 🛛 OFF        | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto 🔵      |
| 1061 🗌 Follow Global 🛛 OFF        | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto        |
| 1071  Follow Global OFF           | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto        |
| 1081 Follow Global OFF            | DHE > OFF Entered Sta | ate_OFF. PS > | OFF Unit is    | disabled.           | Temp: Auto .      |

#### **High Occupany Events**

- Close to Injection events can get very "busy"
- Firmware automatically stops event building to avoid link congestion (4%), no DAQ error
- This affect fraction of PXD data, mostly inner +X modules)
- Small fraction of events affected (and often events where SVD/CDC is not useful, too)



- More tricky: when several of these events are triggered close-by, DHP fifo/link will get full and framing gets our of sync. This may affect several events until it is back in sync.
- This happens often after injection when many triggers come in a short time interval due to background → full veto adjustment helped
- Todo: Firmware to suppress readout-trigger when occupancy gets too high ("internal veto")
- Gated Mode will/can not be used. Several tests in 2019-2021, see large negative impact on data quality (noise, efficiency)

#### **Troubles:** Missing Events ("Missing ROIs")

- (1) On EB2, events from HLT and PXD must be in sync. This requires that all output from HLT to PXD is in sync with output from HLT to EB2.
- (2) All triggered events are matched against HLT decision on PXD side. If HLT decision is missing, event data clogs up memory. Reported as error on run stop if more than "a few".
- Repeatedly issues with HLT dropping events silently in 2019-2022 for different reasons. Eventof-Doom-Buster, misconfiguration of workers, basf2 crash (unpacker) on HLT, last events not flushed/written out etc.
- Even small changes in HLT script had undesired effects → automatic testing
- Monitoring has been added at different places. But excessive monitoring resulted in some slow down when too many events were missing.
- Matching algorithm on EB2 was modified to ignore missing PXD data.



- 2019 observed several issues with Firmware (overlapping trigger FW only available after summer break). Since then only minor changes (more monitoring, gated mode, JTAG/IPBus improvements)
- In 2019, emergency off by diamond beam abort was introduced to prevent damage to detectors. → Needed full automation of turn-on procedure including configuration (pedestal taking and upload). As this happened hundreds of times, the procedure need to be rock-stable
  - Many iterations as many edge cases and race-conditions between HVC, RC, calibration
  - Need remove HV states and other workarounds to allow turning on of PXD w/o accelerator interference.
- DHP link "drops" recovery is issued after 1-2s (reason: DHP issue due to high occupancy)
- Noisy due to sync on wrong turn (pxd frame = 2 turns) fix by FTSW signal
- Open:
  - Regularly "HV Trips": radiation induced SEU in PSU logic on top of Belle (DAQ can continue after recovery to PEAK)
  - New in 2021/2022: SEUs in DHP and DCD ASICs on the modules.
  - Depending on affected register, may influence data quality or DAQ
  - High occupancy during injection

SEU may become major headache in the future, ASICs as well as in non-rad hard equipment (PSU, PCs, micro-controllers, RAM, ... )

#### **Design and Measured Limits**

- Trigger rate
  - No limit on trigger rate itself, but on resulting data rate
- No problem for close-by triggers
- Trigger latency
- Limited by hardware to maximum one readout frame (<20us). Actual number may be slightly lower.
- Occupancy
- "3% limit" by ASIC design (next slide), but even 100% can be tolerated for a short time.
- Data rate
- System design for 1%@30kHz, 3% as safety margin. All internal/optical links designed to cope with this number in mind.
- HLT latency
  - Memory buffers are limited. Usage depends on the event size, trigger rate and **mean** HLT latency.
- Trigger and all data rates have been tested with global daq beyond specifications (incl link saturation and back pressure)



#### The 3% Myth

- ~1% (in the inner layer!) expected from simulation, hardware design with safety factor at 3%.
  - Same number applied for DHH/ONSEN etc
- Even below 3% tracking become worse (wrong hit assignment), lower purity
- PXD-DAQ can stand 3% @ 30kHz, but the resulting data would not be optimal anymore.
- A 3% mean indicates that we have event with higher occupancy, injection noise with 30%(!) observed. Nowadays data is truncated at 4% +4%. Also within the detector the (local) occupancy varies.
  - Several events triggered with >3% may clog FIFO due to bandwidth limit. → workaround: reset pipeline automatically after sec of no-data





#### **"Upgrades" for LS1**

- $PXD1 \rightarrow PXD2$ 
  - Deliver and install second half of DHH system (from DESY to KEK)
  - Power second half of ONSEN boards
  - DCS and GUI had already prepared for full system. No big adjustments expected.
- Install improved DHH Carrier boards (double internal data rate, needed for full lumi)
- DHE firmware to prevent DHP issues by blocking readout-trigger for busy events
- GUI: CSS  $\rightarrow$  Phoebus
- Upgrade of plugins
- Improve alarm system: Immediate alarm ↔ maintenance "alarms"

#### Summary

- Introducing PXD and its DAQ concept
  - ROI selection with HLT feedback
- SC/Monitoring/DAQ stable
  - Issues have been solved.
  - HLT feedback need to be monitored closely (latency, missing events)
- But: Detector/module issues outside DAQ/SC (HV currents, noisy, damages)
- Current system PXD-DAQ system ready for full lumi/trigger rate (assuming DHH carrier replacement)
- Plans:
  - No fundamental changes planned in neither DAQ nor SC/Monitoring for next years
    - "Evolution not Revolution"
  - $PXD1 \rightarrow PXD2$  under control (from DAQ/SC side)
  - New PCIe hardware for clustering of pixels? Need different readout path and concept!

- We currently assume that LS2 may include an "upgrade" (=replacement) of the vertex detectors
- (even if we keep PXD1 and replace it in another shutdown period "LS  $1 \frac{1}{2}$ ")
- As PXD-DAQ design was assuming full luminosity and trigger rate, an upgrade of the current PXD readout is **not** foreseen. (Neither needed nor funded)
- Any replacement will probably come with a (completely) different DAQ, detector control and monitoring. But first need to decide about technology.





## Questions?

#### **Available Bandwidth**

- DHP  $\rightarrow$  DHE
  - 1.5Gbit/DHP  $\rightarrow$  6.0Gbit per module (allows for >3%@30kHz)
  - Overlapping triggers; data belonging to two triggers is transmitted only once
- DHE  $\rightarrow$  DHC
  - 40 x 2.5 Gbit (carrier design issue, hardware update in progress for 5 Gbit)
- DHC  $\rightarrow$  ONSEN
  - 32 x 6.25 Gbit ; 620 MB/s/link  $\leftarrow$  works with load balancing
- ONSEN  $\rightarrow$  EB2
  - 32 x 1 Gbit Ethernet. Expect only 30MB/s/link in worst case
  - 110MB/s tested with current system w/o any problems (back pressure to trigger)

System designed for 3%, 30 kHz @ full luminosity (3% includes safety margin from expected 1.5%) Occupancy L2 < L1, worst case scenario is inner layer L1/L2 allows for "load balancing"  $\rightarrow$  downscale ONSEN 40 $\rightarrow$ 32 subunits

**No issue with projected occupancy/trigger rate** Todo: install DHH carrier fix and 2<sup>nd</sup> half of DHH

#### **Looking at Specific Bottlenecks**

- DHP output:
  - Increasing trigger rate → saturates/approach maximum (=continuous readout)
  - Due to poisson-like trigger distribution (overlaps), even at 50 kHz (frame rate) not yet at maximum
  - Limit by data rate: exceeding 30kHz → 100 kHz (or continuous) would reduce the acceptable (mean) occupancy to ~1.5% (depending on cluster topology)
- DHE/DHC
  - Design 3%@30kHz
  - Due to doubling of data of overlapping triggers, data rate scales linear with trigger rate
  - Continuous readout (need new firmware!) equivalent to 50kHz trigger rate
- ONSEN
  - Design 3%@30kHz with guaranteed HLT processing time
    - No direct limit on trigger rate if HLT decides fast enough
  - ROI filtering would not work for continuous readout, different concept needed

#### **Hardware Issues**

- 2x broken network switch (luckily after ehut power out, thus not during data taking). But mind that we had network switch trouble in HLT affecting PXD connection (ROIs/EB2).
- Special hardware failure (DHH, ONSEN boards, ATCA PSU), not affecting data taking
- Shelf monitor (Rpi) crash (SEU?), worked again after power cycle

#### **PXD-DAQ** Issues and Plan

- Only minor things need to be fixed/updates, but no large single package
- We have to keep an eye on:
  - Events vanishing on HLT (called missing events/ROIs, etc) ... not a PXD issue
  - HLT latency may become an issue
- Injection noise (truncated data...), but DAQ can handle it. Issues in DHP ASIC which we cannot fix.
- PSU radiation hardness, replace components?

#### Wish List

- Better way to handle data copy from bdaq net to DESY/KEK
- Calibration data, automatic analysis plot for DESY web site, backups etc
- User credentials and ssh-tunnel/port forwarding is working but complicated and often needs manual restart
- Spares: common handling of COTS equipment? Switches, hard drives?
- HV scheme & injection inhibit (now in progress)



#### **Gated Mode**





Depends on the distribution of hits on the sensor and cluster topology/size 4 - all unrelated 3 - realistic for large occupancy 2<x<3 - for injection noise

Overlapping trigger effects are not included, thus numbers >20kHz are too pessimistic

We can never send out more data from the ASIC than in the continuous readout case, even if the trigger rate exceeds 30, 50, 100 kHz

#### **Understood: "Trips" due to Over Voltage Protection**

- PXD "HV Trip" rate increased since 2021a, rate ~0.5-1 per day
- Issued by the Over-Voltage-Protection board; hard-wired comparitors
- Uniform channel distribution  $\rightarrow$  no real voltage problem?
- Power supplies located on top of Belle  $\rightarrow$  Hint to SEU in PSU?
- Proved by irradiated one PSU module at MAMI (without sensor!)
- Dosimetry at MAMI and Belle show similar rate/neutrons dose
- ~1 OVP / 1mSv neutron dose per PS Unit
- Neutron rate on top of Belle scales with luminosity
- Expect rate increase by dose and \*2 for PXD2
- One OVP board component is over-sensitive to radiation
- Not expected SEU in electronic equipment at this position at that rate!
- Does not fit with observations of other equipment at same position, even so we saw few SEU in micro-controllers (and memories)
- Possible mitigation (LS1)
- Add neutron shielding
- Try to find radhard CPLD replacement









#### (Major) Downtimes in 2022

- BIIOPS-319
- Crash of a monitoring PC on top of Belle (SEU?) 74min due to access (+49min for SKB issue)
- BIIOPS-317
- Misconfigured ASIC prevent us from taking pedestals (assumed SEU) 39min
- Several small down times due to SEU recovery (power cycle/talking new pedestals)
- Not only down time, but impact on data quality if not identified and corrected in time
- After uncontrolled beam loss, several modules were instable (ASICs damaged) → currents flipped even during a run leading to high occupancies and noise
- $\rightarrow$  several run stops to take new pedestals, degraded data quality





#### BPAC - DCS, 31.10-1.11.2022

#### **Single Event Upsets in ASICs**



- Triple redundant registers in DHP, DCD
- Bit flip by SEU unlikely but possible!
- Observed for DHP (register monitoring since 2020) and rate is increasing
- Mitigation: Re-setting registers on-the-fly during a run
- DCD SEU only observed by result (registers cannot be read). Tested blind write to registers every few minutes, but some unwanted effects in one module.
- DHP memories not (yet) monitored

#### **ASIC** monitoring

- Configurable ASICs of PXD are DHP and DCD
- Both provide internal mechanisms to prevent single event upsets
- Configuration is applied via JTAG from DHE/DHI during TURNINGON
- In May 2020 a monitoring of register was installed at KEK
  - Read 3 registers of the DHP (DCD is write-only) every 20s
  - Compare all PVs with its previous value and ConfigDB
  - Logged as warnings
    - Iater manually pushed to <u>BIIPXDH-554</u>



#### ASIC Monitoring

ASIC Monitoring

#### **SEU** correction

- Full rewrite of monitoring logic was necessary
- Changed DHP registers are reverted afterwards
- Only the DHP with changed registers is rewritten
- Additional information to the shifter if the register was recovered
- In case of multiple changes (>10) the module is shut down directly
  - From CR side, it looks like a TRIP
  - The module is ramped back to PEAK with fresh configuration
- Simon Reiter

#### **PXD DAQ Scheme Phase 3 minimal (full) Setup**

Trigger/FTSW Other Detectors **DATCON Rol** DHP DHE HLT Rol HLT DHP DHE MERGER SELECTOR DHC DHP DHE 🔶 SELECTOR SELECTOR DHE DHP SELECTOR Event Builder 2 DHP DHE x4 (x8) PXD R/O x4 (x8) Storage HV Control, Run Control Run Start Reset by Trigger Data path to local "Bonn"DAQ, but only sub-set of events. Pedestal and calibration!

B. Spruck, PXD Status, TRGDAQ, 26.8.2019, p. 42

JOHANNES GUTENBERG

UNIVERSITÄT MAINZ JG U

#### **DQM Integration into PXD Alarm System and Logging**

# **PXD DQM – Alarms, Logging and Shifter Feedback** Just a reminder the PXD alarm system

- Histogram "Status" and values exported to EPICS PVs
- Easy to put on Shifter OPI, trend plots, archive
- PV status enter BEAST alarm system
- Alarm  $\rightarrow$  Shifter: CSS Alarm panel, Announciator (audio), Message History (PXD) elastic search server), RocketChat, B2 elastic search server

