## **Overview of Belle II DAQ system**

#### R.Itoh, KEK



## 1. Introduction to Belle II DAQ

### Design Policy

- Conventional trigger-synchronized DAQ sequence.
- Deadtime-less design: pipelined trigger flow control.
- COPPER-based readout : combat proven scheme in Belle.
- Scalable back-end DAQ.
- Unified Software Framework
  - \* The same offline software framework (basf2) runs on every component in DAQ (even on COPPERs!)
- "Non-Stop" DAQ
  - \* Once DAQ is started. it is running all the time.
  - \* If a trouble occurs in detector FEE, stop the trigger distribution, fix the trouble locally, and restore the trigger without restarting other DAQ components.

## Trigger/Clock Timing Distribution to detector FEE

Nakao



## Distribute L1 trigger and system clock to ~1000 nodes

- Fast control reset broadcast, partitioning, command to single subsystem, collect status info
- Single type of module FTSW [Frontend Timing SWitch] with a few types of additional daughter cards

## Data Flow in Belle II DAQ

## "basf2" on Linux CPUs



## Data Flow in Belle II DAQ



## **COPPER and Belle2link**



#### 1000Base-T port x 2

Timing Receiver

## Belle2link



- In the FPGA on detector front-end card, "virtual" FINESSE" is implemented, and it talks with "Belle2link transmitter core".
- In COPPER, Belle2link receiver (HSLB) is implemented instead of digitizer FINESSE, and connected to front-end card via optical fibers.
- The receiver "remote controls" the "virtual FINESSE" (slow control) and receives the data stream via optical fibers as if the remote FINESSE is implemented on the COPPER.

developed under collaboration with IHEP China (Zhen'An Liu's group)

#### **Event Builder**

Current design



- Each ROPC sends data via single GbE
- Layer2 switch and Layer3 are connected by **multiple** 10G

## High Level Trigger (HLT)



Unit structure (~10 units, 320 CPU cores/unit)
\* to reduce the number of output port of event builder
\* to keep up with the gradual increase of accelerator luminosity
\* fault-tolerant : each unit is completely independent
Based on the parallel processing technology developed for basf2

### Software Trigger Strategy on HLT



XD R/O

- One unit processes
  - \* 3-5kHz L1 rate (of total of 30kHz) with event size <100kB
  - \* 1/2 rate reduction with "Level 3" filter
    - Based on fast CDC tracking + ECL clustering.
    - Cut in the track |z| position and ECL energy sum
  - \* Full event reconstruction using all detector signals except PXD
  - \* Software trigger using physics event skim codes (Hadronic/tau event selection....) + Monitor trigger -> 1/3 reduction

#### Expected rate reduction : 1/3 = 1-2 kHz/unit at output.

## **Integration of Pixel Detector**



- HLT performs special low momentum tracking and obtain "Rol" in PXD surface for reconstructed tracks.
- "Rol" is sent to PXD readout box for HLT-taken events.
- PXD box associate PXD-hits with Rol by FPGA processing and only associated hits are sent to 2<sup>nd</sup> EVB.
  - -> ~1/10 reduction (data size) + 1/3 (rate) expected.

## **Slow Control**



- Need to manage two different frameworks: NSM2 and EPICS.
- NSM2 is a home-grown slow control framework used in main DAQ compoents.
- EPICS is used in some of detector subsystems and SuperKEKB accelerator.
- A transparent environment is being developed.



## SuperKEKB/Belle II: Operation History



- Phase 1 : Accelerator tuning / Vacuum scrubbing.
- Phase 2 : Test run with outer detectors + pilot VXD detector
- Phase 3 : Physics data taking with all (but not complete) detectors





## DAQ Status in Phase 3

- DAQ is basically running stably.
- Nominal L1 rate is around 3.5 kHz at L =  $5.5 \times 10^{33}$
- The overall DAQ efficiency is still 80 85 %. But when the beam and detector operation is stable, the efficiency is more than 90%.

- Injection veto distribution via FTSW is working stably.

#### Sources of DAQ dead time:

PXD BUSY

 \* Bad modules not sending data

 CDC ttlost/b2ldown

 \* FE reprogramming required

 TOP BUSY

 \* Still firmware debugging.

 ECL BUSY

 \* Wave form readout

 TRG BUSY/ttlost

 \* Still firmware/software debugging.

## **Detailed analysis of errors (non-PXD)**

S.Yamada

- ERRORs detected by COPPER and ROPCs
  - > SVD : mainly after receiving large events
  - CDC : belle2link seems unstable in some links
  - TOP : event # jumps. FW work by TOP experts is ongoing.
  - ARICH : Stable. B2link errors in several COPPERs at the same time.
  - ECL : b2link is stable. Sometimes no events arrives at some COPPERs
  - ➢ KLM : Stable.
  - TRG : Event # jump.
- Recovery of HSLB from large event errors
- COPPER CPU freeze

Investigation ongoing

#### Event size

- Currently, the event size is within the expectation.
  - (except for SVD with larger occupancy.)

- FCI \* Mostly due to the switching to the full wave form readout -> caused BUSY many times \* If it runs in normal mode, ECL is guite stable. [Comment on full wave form readout from DAQ] \* We assumed that the full wave form readout is limited for the calibration purpose only during injection veto trigger. \* But we very recently recognized that ECL group is planning to make it default in DAQ for the hadron ID. \* It is not included in the original DAQ design. -> The data size increase was found to be manageable and we agreed to switch to this option in normal DAQ Further debugging is in progress -> ECL talk

#### - Problems related to TOP

- \* FTSW trouble -> lost 3 hours
- \* Problem in database interface application (daqdbprovider)
   -> lost 3 hours
   -> Still under the investigation
  - => Still under the investigation.



Date



Sub-detectors joining global DAQ



## Overall DAQ status in Phase 3 (by Yamada-san)

Last week livetime ratio while HV of detectors are permitted by accelerator. -> 85% ( 80% with all sub-systems )



#### <u>PXD</u>

#### Persistent issues

Data handling of high occupancy events

Occupancy of single modules drops

#### Slight drop

Occupancy drops to a much lower rate, spikes during injection were still visible

- DHH and ASICs are asynchronous
- Not all ASICs are affected

#### Full drop

Occupancy drops to 0 and no data come at all

• ASIC state machine might got stuck

S.Reiter

#### Workaround

Increase of injection veto vFull to 2 ms prevents occupancy spikes and stuck read out.

- Further improvements in DHH/DHI firmware were made and the problems seem to be fixed.

#### **ROI** selection

ONSEN selects only pixels within given ROIs by HLT extrapolation and DATCON calculation

- <u>Currently no ROI selection possible for all modules</u>
- ONSEN firmware design requires increasing module ID with single DHH
- ightarrow required for matching IDs between data and ROIs
- Possible solution:

Change order in DHC firmware before sending to ONSEN (will be tested at beginning of summer shutdown)

ROI selection for remaining modules verified with data of Local DAQ

Event selection not affected.

- Rol based data reduction was not available in Phase3.

\* Some mistake in the ordering of module ID in DHH? -> fixed by firmware update? or cabling?

# **CDC front end operation**

- CDC is basically stable
  - rerr / fifoerr / timer / feer /semmbe / semcrc various errors
    - SALS doesn't work and we need to re-program FPGA on FE
  - Unpacker error : need to re-program FPGA on FE
  - These issues can be fixed within 2-3 min just after excluding
- 7 FEs are masked during phase-3 physics run
  - #247,204,218 (b2llost) : fixed once ~2 years ago and appear again
    - replacement of board didn't work. it was fixed by swapping cat.7 cable
  - #37,193 (b2llost) : new
  - #97, 115 (ttlost/crc error/b2link error) : new. it was occurred when we had resumed operation with Belle solenoid turned ON

\* Why reprogramming of FEE is required? Unrecoverable SEU?



```
K.Nishimura
```

- Highest priority on addressing problems that require masking large parts of detector (e.g., full 128-channel module) and/or have long recovery times.
  - PS Lockup:
    - Continuing to address errors seen in feature extraction (noncompliant data from carriers).
    - Adding SEU monitoring in PL to assess... desired feature anyway.
  - Event number mismatch / invalid event number:
    - Maybe some progress in simulation?
    - Really could use some ideas here!
    - Is our experience completely unique? Anything to be learned from other subdetectors?
      - Still various bugs in firmware
         -> further effort during summer shutdown.

Raw data are saved:

A.Kuzmin

1-Part of events (1/1000) to monitor FPGA logic

2-Random trigger events for Overlay backgroun

3-E>E\_thresh (50 MeV) for the hadron/gamma separation

1 and 2 have no problems

3 was implemented later and indicates any problems for steady run. Increasing of data size for 5-10%.

in the case of background brusts it causes DAQ crash.

- The increase in the data size is not so much and within a manageable level.

-> DAQ group agreed to go with waveform readout for ECL.

-We did not have problems in phase 2 and we did not<sup>A.Kuzmin</sup> have problems before continous injection -Brust produce huge number of energrtic hits (several thouthands).

-If the burst event happens while the previous «raw data» event has not been readout buffer overwrites. It causes run crash. But often we have crash of ecl firmware. The last is ECL firmware bug and should be fixed.

-V.Zhulnov will work with it in August.
-Fix that firmware would not crash
-Work to eliminate readout huge background events

-Try to identify the huge background events by trigger?
-Veto events with E\_tot>20 GeV?
-It will allow us to store data now!

Further debuggng of firmware is required to be prepared for burst events.
Identification of background event by having veto trigger with E\_tot>20GeV
already talked to trigger group.

## <u>TRG</u>

## **DAQ crash caused by TRG**

- Busy
  - Recovered by SALS.
- Slow control stuck in NOTREADY or FATAL
  - Recover by restarting SC
  - TRG SC still unstable. Investigating reason.
- CDCTRG dataflow lost due to CDCFE
  - Recover by masking or reprogramming the FE
  - GDL stops generating L1
- At 10 kHz, TRG caused busy.
  - Didn't crash run, but trigger rate went down to ~ 2 kHz.
  - Didn't happen with GDL+GRL.

Needs more debug to stabilize, especially firmware and SC.

|         |         | RUNNING | 101203 | RUNNING | GISFU         | RUNNING |
|---------|---------|---------|--------|---------|---------------|---------|
| RUNNING | TRGGRL  | RUNNING | GT3D0  | RUNNING | lGTSF1        | RUNNING |
| STOP    | TRGT2D0 | RUNNING | IGT3D1 | RUNNING | <b>\GETF0</b> | RUNNING |
| ABORT   | TRGT2D1 | RUNNING | IGT3D2 | RUNNING | LMTRG         | RUNNING |

#### H.Nakazawa

## <u>FTSW</u>

Sugiura-san is studying various correlation in FTSW and accelerator parameters.
Very useful to understand the beam condition.

R.Sugiura

## Trigger rate and luminosity



## Relation between trigger rate and deadtime



Exp 8 Run 1-1350

•The relation between trigger input rate and dead time is fitted by y=ax<sup>b</sup>, and extrapolated to 30kHz.

| Detector | Extrapolation to<br>30kHz [%] |
|----------|-------------------------------|
| ARICH    | 0.01                          |
| CDC      | 0.03                          |
| ECL      | 0.0006                        |
| KLM      | 0.006                         |
| SVD      | 0.0002                        |
| TOP      | 0.009                         |
| TRG      | 0.0009                        |

## ttlost investigation

- monitor b2tt phase adjustment result
  - At every linkup, b2tt does phase scan to find a safe operation point of the clock/data phase difference
  - The result is not monitored so far, but it should be useful info
- special firmware to quantify the b2tt connection quality?
  - b2tt just gives 0 or 1 and hard to get the error rate
  - if the clock/data phase difference is manually fixed from remote and make a scan, an eye-diagram (or bath-tub) like plot can be generated
  - In this case, the data should be a random pattern (e.g. PRBS-31)
- another special firmware to test the connection quality?
  - since b2tt is based on 8b10b, the data may be well balanced
  - using PRBS-31 pattern may be a good test
- homework, possibly for this summer

## Current event size on COPPER (including DAQ overhead)

S.Yamada



#### Event size at ERECO

#### Current total event size is ~100kB/ev



#### Backend Processing (Event Building / HLT / Express Reco)

- Event building is stably
- At the very beginning of Phase 3, HLT was unstable because of the incomplete Linux signal handling at STOP/ABORT.
   Fixed by Nils and now the operation is quite stable.
- The maximum processing rate with 5 HLT units is now >10kHz.
- Time to STOP/ABORT HLT has been reduced to 30 sec (from up to 5 min.) by new DQM histogram storage scheme.
- HLT selection has been tested and confirmed. Finally it was turned on and now stably working.
- ExpressReco is also stably working.
- The DQM scheme using HLT and ExpressReco is in a good shape.
   \* Reference histograms are superimposed for the comparison.
   \* Quality checking scheme has been established.

### Basically keep using the same DAQ configuration in fall run.

+ HLT reinforcement during summer shutdown:
1) Use of cvmfs to share the same software among HLT units.

- 2) Addition of 5 more HLT units.
  - \* 2 units will be operated in the current configuration as a margin.
  - \* 2 units will be used for the test of new ZeroMQ HLT framework.
    - -> will be prepared to be compatible with existing framework and added in the global DAQ time to time for the test.
  - \* 1 unit will be used for the test of new framework on SL7.

# with various improvements in all subsystems