## ARICH

Yun-Tsung Lai On behalf of ARICH group

KEK

ytlai@post.kek.jp

Belle II TRG/DAQ workshop 2019

August 26, 2019





- General status.
- DAQ and firmware.
- Slow control.
- ARICH chiller problem.
- Summary.

• Performance check.

#### General status



- ARICH has 6 sectors. Each one corresponds to 1/6 of the HAPD.
  - 1 sector: 12 Merger boards.
  - 1 Merger: 5~6 FEB.
  - 420 HAPD + FEB and 72 Mergers in total.
  - Full functioning and stable DAQ system in phase3.
    - 2 Mergers with FTSW firmware downloading program. (2 PC prepared for them)
  - Summer shutdown: investigation on those Mergers.
  - Jul. 23: Cable disconnection
- Jul. 24: Endcap extraction.
- Sep. 6: Cable reconnection for endcap push-in.

# SEU detection in FEB/Merger firmware

- With target luminosity of Belle II: ~5 SEU/hr (0.2 uncorrectable/hr).
- FEB: Spartan6 FPGASoft error mitigation IPcore.
  - status\_heartbeat & !status\_uncorrectable
- Merger: A similar sem IPcore for Virtex5 FPGA
  - Single-bit error, multiple-bit error, crc error SEU.
- Implemented at the end of March
  - Can be monitored by HSLB register, SLC and CSS now.
  - In data, we finally decided to keep the data from the FEB SEU from 8th May.
     So far, nothing wrong happens in data flow when SEU happens.

| MB SEU  |  |
|---------|--|
| FEB SEU |  |
|         |  |
| MB SEU  |  |
| FEB SEU |  |
|         |  |
| MB SEU  |  |
| FEB SEU |  |
|         |  |
| MB SEU  |  |
| FEB SEU |  |
|         |  |

# SEU detection in phase3

- About ~2 FEB SEU would happen per day.
- Merger SEU: happens only once.
  - SLC of copper would be dead.
     Recovered by re-programming Merger.
- Recovery can be simply done by BOOT button in CSS. (FEB re-configuration)
- Power-cycle the entire system in 1 to few days.
  - Sometimes, power-cycle of LV is also effective for some other problems (empty HAPD during platinum week).
- So far, no real problem observed due to it. Now the FEB data is still kept with SEU, and no fatal problem in data flow (HLT unpacker) have been seen so far.

#### Yun-Tsung Lai (KEK) @ Belle II TRG/DAQ workshop



ARICH FEB SEU in phase3

- Local run scheme: threshold scan.
  - Quick and efficient way to check all HAPD, FEB-Merger links, and DAQ status.
- Original threshold scan scheme:



- It takes roughly 10 min for 100 steps \* 1000 events.
- SLC  $\rightarrow$  HSLB parameter writing is done when a threshold is done.
- New scheme with new FEB firmware:



Total 1000\*100 events.

When reaching 1000 events, vth will be incremented and written to ASICS inside FEB.

- Non-stop in the middle. Incrementing on vth is done inside FEB firmware.
- Entire data taking process takes O(s).

#### Internal threshold scan mode

- Setup for initialization through SLC and HSLB register: Just needs to be done once.
  - Initial vth value: 2\*8bits
  - vth step size: 1
  - Number of events for 1 threshold: 2
  - Switch: 1
  - Total number of events/threshold: Set it by CSS.
     Number of events per step\*Number of steps
  - Initialization script is executed in each copper to avoid too many ssh through ROPC to save time.
- Changes in software:
  - 100 files  $\rightarrow$  1 file
  - Check the sub-run ID of each files by EventMetaData.
    - $\rightarrow$  Threshold value is stored in data header.

#### DAQ Problems in phase3

- FTSW 161 was somehow broken (can't be recovered through JTag).
  - Replaced with 172 on Feb. 28<sup>th</sup>.
- FTSW 159 kept producing ttlost in few ten minutes from Feb. 28<sup>th</sup> to Mar. 1<sup>st</sup>.
  - AVAGO of trg output at 203 is broken and replaced.
- Failure of FTSW programming on 2 Mergers: 4\_6 and 4\_8
  - Using two laptops, USB amplifier, and JTag adaptor for them.
  - Investigation during summer shutdown.
- ttlost/fifoerr:
  - If time has passed for a long time (a few days) since last power-cycle. Random ttlost would happen more frequently (once a few hours).
  - Same for the fifoerr problem, but the fifoerr happens only in 6\_1. → Hardware connection problem?

#### DAQ Problems in phase3: firmware and SLC

- Copper SLC state is stuck in UNKNOWN between 20th and 22th May
  - Temperature readout showed problem in between as well.
  - Merger firmware was updated right before that.
  - Might be due to version difference between local SLC in ropc and the central one.
- Firmware (in terms of HDL coding) should be fine it's not a constant problem.
  - Try to include timing constraints to all clocks, and use smartxplorer to optimize it.
  - Looks fine so far?

|   | Strategy            | Host     | Output | Status | Timing Score | Run Time    | LUTs         | Slice Registers | WorstCaseSlack |
|---|---------------------|----------|--------|--------|--------------|-------------|--------------|-----------------|----------------|
|   | MapRunTime          | btrgpc05 | run1   | Done   | 0            | 00h 10m 17s | 14,015 (48%) | 13,702 (47%)    | 0.055ns        |
|   | MapLogicOpt         | btrgpc05 | run2   | Done   | 0            | 00h 12m 59s | 14,021 (48%) | 13,702 (47%)    | 0.041ns        |
|   | MapGlobOptIOReg     | btrgpc05 | run3   | Done   | 0            | 00h 14m 50s | 12,032 (41%) | 13,231 (45%)    | 0.030ns        |
| F | MapRegDup           | btrgpc05 | run4   | Done   | 0            | 00h 09m 24s | 14,014 (48%) | 13,702 (47%)    | 0.038ns        |
|   | MapExtraEffortIOReg | btrgpc05 | run5   | Done   | 0            | 00h 09m 55s | 14,016 (48%) | 13,654 (47%)    | 0.031ns        |

# DAQ Problems in phase3: initialization

- In cold-start, we need to booths (re-programming HSLB) after re-configuring Merger.
  - Otherwise, 8-bit data shift problem will happen.
  - As far as I confirmed, only ARICH needs it. (CDCFE is also virtex-5 GTP)

| FATAL | ARICH    | 14/04 11:15:47 | ROPC405 | cpr4013 : ERROR_EVENT : B2LCRC16 (00ff) differs from one ( 53a9) calculated by PreRawCOPPERfromat class. Exiting |
|-------|----------|----------------|---------|------------------------------------------------------------------------------------------------------------------|
| DEBU  | IG ARICH | 14/04 11:15:47 | ROPC405 | cpr4013:00ff00ff 00ff00ff ff550000                                                                               |
| DEBU  | IG ARICH | 14/04 11:15:47 | ROPC405 | cpr4013 : 00ffooff              |
| DEBU  | IG ARICH | 14/04 11:15:47 | ROPC405 | cpr4013 : 00ff00ff              |

- When the data shift happens, FIFO will stop being filled. 00ff is the pattern taken from an empty FIFO inside HSLB.
- After looking into data, 00ff happens during the user data transmission. Not during protocol hand-shake stage.
- Target: Can we detect it before run starts (during hand-shake), and then solve it by firmware-level approaches (reset the GTP or modify the B2L protocol)?
  - Need to check the B2L TX and RX state.

### DAQ Problems in phase3: initialization (cont'd)

• When this 00ff problem happens, the data is like:

 000020a8
 7f7f820c
 01c70800
 0000000
 02f69887
 5ca6b68a
 0400009
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 0000000
 00000000
 0000000
 0000000

• 00ff happens during the user data transmission? Not during protocol patterns exchange (hand-shake) stage?

# DAQ Problems in phase3: initialization (cont'd)

- First check: hardware tendency.
  - After reproducing the problem for several times, the problem seems to happen only among some specific Merger-HSLB.
- Symptoms: After re-configuring Merger, →00ff would happen w/o booths.
  - $\rightarrow$  "hslb is in bad state" sometimes.
  - $\rightarrow$  "bad state" message can be cleared by booths, or just wait for ~1 min.
    - Both "bad state" and "00ff" are not 100% to happen.
    - Both can be cleared by re-connecting the dLC cable.
    - By swapping dLC, both are found to be Merger side's problem.
      - $\rightarrow$  Firmware-level solution looks doable.
- Recent update:
  - The problem happens on specific Mergers: No need to scan all the HSLB to save time.
- More study will be done after ARICH is back.

```
booths_s.sh 4005 a
booths_s.sh 4006 c
booths_s.sh 4007 d
booths_s.sh 4008 a
booths_s.sh 4009 b
```

#### DAQ Problems in phase3: Merger 4\_6

- The response (idcode) from Merger:
  - Correct case: c2a96093
  - 4\_6 for now:  $8552c127 \rightarrow 1$  bit is shifted.
  - 1-bit shift should not happen with broken cat7, so FPGA might be broken.

FTSW

patch panel

Merger

• Try with JTag/RJ45 adapter and laptop:



٠

# DAQ Problems in phase3: Merger 4\_8

- For Merger 4\_8, the problem started in the beginning of phase3.
- The response (idcode) from Merger is all 0.



4 8: Bypass is made near FTSW.

- We cannot use "Digilent JTAG-HS3" as a download cable.
  - Only Xilinx HWA-USB-II-G works.
- No problem to add 12m or 30m USB extender cable for 4\_8.
  - OK move the PCs outside of radiation region.
- Investigation during summer:
  - We connected both 4\_6 and 4\_8 back to FTSW, firmware downloading was working properly for a few days.
  - However, physical touch on the patch panel might make the problem happening again. Touching again would make it recovered.
  - Still need more check after ARICH system is back.
  - More JTag-RJ45 adaptors are under production now.

# Busy in global high rate test

- ARICH was once included in global run on  $12/14 \sim 12/15$  in 2018.
  - No critical problem in the system.
  - Busy happened when trigger rate > 15 MHz
- High rate test with different occupancy is ongoing to see the response of dead time.
- Revisit the FEB/Merger firmware to check the FIFO's design for further improvement.

Value

0000

11.600 ns

0000

[1,800 ns

00.



fifo re

fifo dout[127:0

#### FEB SEU recovery

- In 33rd B2GM, we had a report from Raffaele Giordano about FPGA self-repair by using Majority Voting method to repair FPGA configuration.
- Implement Scrubber in Merger board. Majority vote frames from different FEBs through partial configuration via JTag.
- Testbench with Virtex-5 and Spartan-6 demo boards has been checked.
- Test with ARICH FEB and Merger boards are ongoing.



replica #5



#### Slow control status

- Automatic Early warning system:
  - By constant monitoring of the EPICS parameters.
  - Monitor Nsm2cad scripts to ensure all the data are logged.
  - HV / LV / FEB / Temperatures / Cooling Water Flows.
  - Different types of logging and levels of warnings.
  - Inform shifters.
- Threshold parameters monitoring:
  - Process the data at qawkXX.
  - root files  $\rightarrow$  display the result at website.
- Slow monitor data analysis is also ongoing:
  - LV, HV, copper, RC



Yun-Tsung Lai (KEK) @ Belle II TRG/DAQ workshop

#### Daily threshold scan for local run



- Automation of the procedure from data to web:
- 1. Transfer data to KEKCC by using rsync from QA server
  - Data files can be selected based on ARICH DAQ run record
- 2. Process threshold scan data in KEKCC
  - Validity of each channel: offset, gain, efficiency, etc.
  - List bad channels.
- 3. Visualize the results in a web server
  - http server outside of KEK => Kitasato or TMU

#### Slow I/F to LED driver



- Automation to run LED from control GUI:
- Current: login VME CPU to command VME access to TT-IO
  - Can not monitor status of TT-IO
- Next: Make a daemon in VME CPU to start / stop TT-IO
  - Should be NSM2/EPICS based system
  - Call commands of TT-IO access
- Future: replace by more intelligent hardware (flexible intensity etc.)

# Chiller problem on Jun. 25



- Water level became lower than the threshold due to evaporation (cover was slightly opened). But the water level display was also wrong.
- To be improved:
  - Alarm in B3 shift room of temp. interlock.
  - No interlock is connected to the chiller.
  - No display of ARICH temperature in GUI. Should be also notification to shifters.
  - Longer-term issue: replacing the chiller.

٠

٠

٠

- In phase3, ARICH DAQ and firmwares are basically stable.
  - Some problems are under investigation.
  - SEU detection are ready in firmware and SLC, etc.
  - A new scheme for threshold scan.
  - FEB self-repair by using Majority-Voting method is under development.

- Slow control work:
  - Automatic alert.
  - Slow monitor data analysis.
  - Daily threshold scan scheme.
  - LED control.

# Performance

2019/08/26

Yun-Tsung Lai (KEK) @ Belle II TRG/DAQ workshop

### ARICH performance

• prod8 & exp7 bucket6

$$- D^* \rightarrow D^0 \pi$$

- pValue > 0.001, pt > 0.1, |z0| < 8.0, d0 < 3.5, nCDCHits > 10, vertexTree()



### ARICH performance (cont'd)

- prod9 & exp7,8
  - $D^* \rightarrow D^0 \pi$
  - |z0| < 5, d0 < 2, nCDCHits > 0,  $|DST_M DST_D0_M 0.14543| < 0.0015$
  - p > 0.7 GeV && cosTheta > 0.85



## ARICH performance (cont'd)

- prod9 & exp7,8
  - $D^* \rightarrow D^0 \pi$
  - |z0| < 5, d0 < 2, nCDCHits > 0,  $|DST_M DST_D0_M 0.14543| < 0.0015$
  - p>0.7 GeV (all tracks within ARICH acceptance).



### ARICH performance (cont'd)

- prod9 & exp7,8
  - $D^* \to D^0 \pi$
  - $|z0| < 5, d0 < 2, nCDCHits > 0, |DST_M DST_D0_M 0.14543| < 0.0015$
  - no p and cosTheta requirement.

All tracks from D0 within ARICH acceptance using ARICH+TOP+dEdx.



# Backup

2019/08/26

Yun-Tsung Lai (KEK) @ Belle II TRG/DAQ workshop

# ARICH ttlost and FTSW problem

- Frequent ttlost in ARICH from Feb. 28<sup>th</sup> to Mar. 1<sup>st</sup>.
  - All the Mergers of FTSW159 kept producing ttlost in few ten minutes.

| 1=15900 | 3b1000fc | ttlost=72                 |
|---------|----------|---------------------------|
| 2=15903 | 5Ъ100000 | ttlost=me [1_7 cpr4002c]  |
| 3=15904 | 5b100000 | ttlost=me [1_8 cpr4002d]  |
| 4=15905 | 5b100000 | ttlost=me [1_9 cpr4003a]  |
| 5=15906 | 5b100000 | ttlost=me [1_10 cpr4003b] |
| 6=15907 | 5b100000 | ttlost=me [1_11 cpr4003c] |
| 7=15908 | 5b100000 | ttlost=me [1_12 cpr4003d] |

- ARICH FTSW:
  - Connection between 159 and 203: 2 dLC (1 for clk. 1 for trg.)
  - By swapping cables, AVAGO of trg output at 203 is broken and replaced.



• Another issue: FTSW 161 was somehow broken (can't be recovered through JTag) and was replaced with 172 on Feb. 28<sup>th</sup>.

#### HAPD status

Neutron fluence estimate:  $\sim 30 \text{ nA/APD} = 1/1000 \text{ of expectation for the whole Belle2}$ operation.





- Dead HAPD, except for HV modules problems:
  - SEU (data masking/solved).
  - Wrong parameter in initialization. (2 modules/run)
- Noisy HAPD channels: stable
- HAPD ineff. level ~2%. Under control. Need more check to pin down the reasons for them.



hot