Automatic Compilation, Deployment & Debugging of DNNs on Cloud FPGAs What are the DevOps besides the papers?

Burkhard Ringlein IBM Research – Europe Zurich, Switzerland

Presentation at cFDevOps22, 2022-09-01, Belfast

© 2022 IBM Corporation



## The mess we are in: Computing is running out of steam and energy...



|                                                                                                                                   | <del></del> 100  | 1                                  | <sub>r</sub> 300 |
|-----------------------------------------------------------------------------------------------------------------------------------|------------------|------------------------------------|------------------|
| uction Problem 2                                                                                                                  |                  | - 10 <sup>21</sup>                 |                  |
| Linear continuation<br>of current trend<br>consumption                                                                            | - 80             | - 10 <sup>19</sup> <sup>2</sup> 01 | accelerators     |
| edicted)                                                                                                                          | cy in %          | - 10 <sup>17</sup> .<br>btion      | with ac          |
| adain. Morening of a Machine Learning                                                                                             | accuracy         | - 10 <sup>15</sup> Wnsuo           | stems v          |
| nglein, Mapping of a Machine Learning<br>epresentation to Distributed<br>ed FPGAs, University of Erlangen-<br>022, (CC BY-SA 4.0) | 0<br>LSVRC top 1 | - 10 <sup>13</sup> Guerd           | 00500 systems    |
|                                                                                                                                   | _                | - 10 <sup>11</sup> Sector          | ber of TO        |
|                                                                                                                                   | - 20             | - 10 <sup>9</sup>                  | -50 gunu         |
| 30 2040 20                                                                                                                        | 050              | 107                                | 0                |
|                                                                                                                                   |                  |                                    |                  |



B. Ringlein / cloudFPGA Team / cFDevOps22 / © 2022 IBM Corporation

- $\bullet$
- FPGAs for AI/ML...
- **But**:
- world"?

## ....FPGAs to the rescue?!?

Computing is running out of steam and energy... - Especially for compute demanding workloads like AI/ML and HPC

• Yeah, there are thousands of papers about using

- Most of the accelerators are GPUs (~8 – 10% of global compute capacity [1])

Largest FPGA deployments (AFAIK):

48 FPGAs at PC<sup>2</sup> ("production", Alveos)

▶ 96 FPGAs at IBM Research Zurich

("experimental", cloudFPGA platform)

Cloud services hard to measure, but no large growth observable...

• So, why aren't there more FPGAs "in the real

...maybe it has something to do with tools and Development & Operations support?

# Agenda: Our Journey to **DevOps** for DNNs on Cloud FPGAs

 $\rightarrow$  In this presentation, I will analyze the challenges of deploying a distributed AI inference application on FPGAs in the Cloud and present how we worked around them.



## $\rightarrow$ Goal: (1) Highlight blind spots of current state of the art and (2) make you all eager to use our tools!

B. Ringlein / cloudFPGA Team / cFDevOps22 / © 2022 IBM Corporation

# ML Acceleration: Why FPGAs are becoming popular

- FPGAs have a performance penalty compared to specialized chips (i.e. ASICs, e.g. "TPUs")
  - due to the resources "overhead" necessary to be reconfigurable
- On the contrary: ML algorightms and models, especially Deep Neuronal Networks (DNN), change frequently
  - → FPGAs can adapt instantly, ASICs can't adapt at all
  - → This becomes even more relevant if the development time of ASICs are taken into account
- Equally, used data types vary increasingly



| First       | Use<br>ASIC Super |
|-------------|-------------------|
| Performance | Planning          |



# ML Acceleration: Why FPGAs are necessary

- Some AI researchers point to a "Hardware Lottery" [3]:
  - Success of ML algorithms depends on their fit to current hardware, not on 'superior' concepts
  - Some novel concepts run e.g. faster on a CPU than GPU or TPU
  - But: "Coding even simple algorithms on FPGAs remains very painful and timeconsuming."
- We can still increase the efficiency of hardware based on domain specific tasks using reconfigurable computing





[3]

# Using FPGAs for non-domain experts: Current obstacles

- Despite consolidated tool chains and high-level synthesis, using FPGAs for HPC with current industry tools...
  - Is still not straight-forward
  - Requires a high frustration tolerance
  - And requires still some architectural knowledge and re-coding of targeted kernels
- Research delivers far more narrow-scoped "proof of concepts" than end-to-end examples

   (luckily, this starts to change...)
- In the end: FPGA beats CPUs regularly by two orders of magnitude and can beat GPUs [6]
  - ... after investing months of optimization
  - not mentioning debugging, deployment, operation, etc.

B. Ringlein / cloudFPGA Team / cFDevOps22 / © 2022 IBM Corporation

## Description 24 cores of Xeon (Cascade Initial FPGA port Optimised for dataflow Optimised memory access Optimise matrix multiplica Ping-pong buffering Remove pipeline stalls Increase clock frequency t

Description

1 CPU core

24 CPU cores

V100 GPU

1 FPGA kernel

2 FPGA kernels

**4 FPGA kernels** 

| Performance<br>GFLOPs | % CPU<br>performance                                                | % theoretical<br>performance                                                                                                                                                                                                                                      |
|-----------------------|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 65.74                 | -                                                                   | -                                                                                                                                                                                                                                                                 |
| 0.020                 | 0.03% Yon-Ne                                                        | eumann based algorithm                                                                                                                                                                                                                                            |
| 0.28                  | 0.43%                                                               | augu 4.06%                                                                                                                                                                                                                                                        |
| 0.42                  | 0.63%                                                               | 4.06%<br>erence in performance<br>45.24%                                                                                                                                                                                                                          |
| 12.72                 | 19.35%                                                              | ox. 40<br>ce in p                                                                                                                                                                                                                                                 |
| 27.78                 | 42.26%                                                              | 45.54%                                                                                                                                                                                                                                                            |
| 59.14                 | 89,96%                                                              | 96.95%                                                                                                                                                                                                                                                            |
| 77.73                 | 118% Optio                                                          | mised dataflow based<br>algorithm <sup>3%</sup>                                                                                                                                                                                                                   |
|                       | GFLOPs<br>65.74<br>0.020<br>0.28<br>0.42<br>12.72<br>27.78<br>59.14 | GFLOPs         performance           65.74         -           0.020         0.03%         Xon-Ne           0.28         0.43%         4           0.42         0.63%         4           12.72         19.35%         4           59.14         83.96%         0 |

| Performance<br>(GFLOPS) | Power usage<br>(Watts) | Power efficiency<br>(GFLOPS/Watt) |
|-------------------------|------------------------|-----------------------------------|
| 5.38                    | 65.16                  | 0.08                              |
| 65.74                   | 176.65                 | 0.37                              |
| 407.62                  | 173.63                 | 2.34                              |
| 74.29                   | 45.61                  | 1.63                              |
| 146.94                  | 52.47                  | 2.80                              |
| 289.02                  | 71.98                  | 4.02                              |

## Example application: Accelerated Inference-as-a-Service



# Our tool: **DOSA**, automated compilation of CNN to distributed FPGAS

- Large CNN automatically distributed & partitioned across FPGA (e.g., in cloudFPGA)
  - Target-specific transparent selection of optimal implementations across frameworks
  - Combining of different FPGA micro architectures
- Imports community standards ONNX and leverages published open source tools: TVM, hls4ml, haddoc2, VTA, ...
- Hardware agnostic, heterogeneous communication framework
- Device support:
  - Current: cloudFPGA, x86CPU
  - Upcoming: Alveo





9



B. Ringlein / cloudFPGA Team / cFDevOps22 / © 2022 IBM Corporation

 $\rightarrow$  best trade-off?

10



# Requirements for an accelerated INFaaS

- FPGA Cloud:
  - Debugging of resource errors

  - Operation resource abstraction (i.e. SRA) Control plane integration of FPGAs
  - Deployment processes
  - Security
- Communication:
  - To and from the consumer
  - If distributed: communication & synchronization between FPGA nodes
  - Debugging of communication
- Accelerated inference application:
  - ML kernel implementation, data representation / quantization
  - If distributed: model partitioning
  - Debugging of inference kernel
  - Portability of the design

Covered "sufficiently" in literature

- System generation at compile time





## System generation

- We all know how to implement an FPGA application kernel...that's what we are here for
- But how to connect different kernels within one Role automatically?
- Common bus protocol between IP cores necessary
  - Calculate required bandwidth
  - Generate FIFOs/AXIs in VHDL and tcl
  - "Register" adapter within system design
- Automatic generation of Wrappers is also important for generating debugging (→ later)

| def                                                                                    | get_                 | vho        | Il_e        | ntit | ty_de                         |
|----------------------------------------------------------------------------------------|----------------------|------------|-------------|------|-------------------------------|
|                                                                                        | if s                 | elf        | c           | alc. | _bitv                         |
|                                                                                        |                      | sel        | .f          | cal  | culat                         |
|                                                                                        | sing                 | le_        | dec         | 1 =  | ('cc                          |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      | 'er                           |
|                                                                                        | tota                 | 1_0        | lecl        | = \$ | singl                         |
|                                                                                        |                      |            |             |      | '\n                           |
|                                                                                        |                      |            |             |      | sing                          |
|                                                                                        |                      |            |             |      | '\n                           |
|                                                                                        |                      |            |             |      | sing                          |
|                                                                                        |                      |            |             |      | s in                          |
|                                                                                        |                      |            |             |      | decl                          |
| i i                                                                                    |                      |            |             |      |                               |
| . / 1                                                                                  |                      |            |             |      |                               |
|                                                                                        |                      |            |             |      |                               |
| .8 #-<br>.9 # `<br>.0 #-<br>.1 ¶<br>.2 se<br>.3 se<br>.4 se<br>.5 se<br>.6 se<br>.7 se |                      | 1 - 00<br> | P :<br>     | F1F0 | Gene                          |
| 1 ¶                                                                                    | t ipM                | lodN       | 200         | "Cif | o_inp                         |
| .2 se<br>3 se                                                                          | t ipN                | lame       |             | "fif | o_ger                         |
| 24 se <sup>-</sup><br>25 se <sup>-</sup>                                               | t ip\<br>t ipL       |            |             |      | inx.c                         |
| 6 se                                                                                   | t ip\                | /ers       | ion         | "13. | 2"¶                           |
| .∕ se<br>⇔C                                                                            | t ipC<br>ONFIC       | G.Ou       | rsι<br>tput | Dat  | . <mark>st</mark> CC<br>a_Wic |
| ⇔C<br>28<br>9<br>30 ¶<br>31 se                                                         |                      |            |             | C0 ] | NFIG.                         |
| 80 ¶                                                                                   |                      |            |             |      |                               |
|                                                                                        | t rc<br><b>{ipCf</b> |            |             |      | ize_i                         |
| 2 ¶                                                                                    |                      |            |             |      |                               |

```
eclaration(self):
  != self.bitwidth:
 _bitws()
 ponent {name} is\n' +
  port (\n' +
     clk : in std_logic;\n' +
     srst : in std_logic;\n' +
            : in std_logic_vector({width} downto 0);\n' +
     din
     full : out std_logic;\n' +
     wr_en : in std_logic;\n' +
     dout : out std_logic_vector({width} downto 0);\n' +
     empty : out std_logic; \n' +
     rd_en : in std_logic ); \n' +
 component; \n')
.e_decl.format(name=self.name + '_tdata', width=(self.tdata_bitw-1))
gle_decl.format(name=self.name + '_tkeep', width=(self.tkeep_bitw-1))
gle_decl.format(name=self.name + '_tlast', width=(self.tlast_bitw-1))
```

erator¶

```
----¶
```

```
put_0_tdata"¶
nerator"¶
com"¶
```

ONFIG.Performance\_Options {First\_Word\_Fall\_Through} CONFIG.Input\_Data\_Width {64} dth {64} \¶ .Input\_Depth {512} CONFIG.Output\_Depth {512} \¶

ip \${ipModName} \${ipDir} \${ipVendor} \${ipLibrary} \${ipName} \${ipVersion}

```
if { ${rc} != ${::0K} } { set nrErrors [ expr { ${nrErrors} + 1 } ] }¶
```

## Recap: Hardware Abstraction $\rightarrow$ Shell Role Architecture



### **SHELL (privileged logic)**

Abstracts hardware components of FPGA and exposes standard AXI(S) interface to user



## Portable system generation

- schema
- In best case, we don't want to re-write a compiler for every new platform
- Besides configuration & control registers, there are usually (one of) two communication channels:
  - address based: PCIe, via Memory (AXI4 Full)
  - **stream based**: network or PCIe abstraction (AXI4 Stream)
- System generation should be able to adapt within this template
  - DOSA connects the IP cores based on their bus protocol specification and dependencies among each other
  - Additionally: creates adapters if necessary

• usually FPGA applications exist not alone: See as part of a (complex) application communication

# Debugging of "operation resources"

- "Operation resources": resources offered to the application during operation: e.g. memory space, network access, configuration registers
- If...data doesn't arrive at the right time at the right place...*who to blame?* 
  - Bugs in the application?
  - Lost in the data center / communication fabric?
  - Lost in the Shell?
  - Memory failures?
- → Best: Provide control counters at the Shell Role interface & verifying the memory prior to deployment









# cFDK: "flight recorder data" and memory test

- At boot-time: Check physical health
  - DDR4 synchronization
  - ETH clock & synchronization
- Occasional complete memory tests
- Monitor live data in the FPGA
  - Compromise between amount (i.e. 5 last packages) and overhead → focus on most expressive data
  - Request via REST API

/clusters/{cluster\_id}/flight\_recorder\_data Requests network runtime information of all instances

| FPGA Module<br>CARM<br>UNITED STATES |                                                                  |  |
|--------------------------------------|------------------------------------------------------------------|--|
| Daia Center<br>Network               | CRAM<br>Reference to a<br>Memory to<br>Memory Tost<br>Network 10 |  |
|                                      | Data Center<br>Network                                           |  |

GET

### Memory Test: Structure and Internals



### Memory Test Algorithm:

Test sequentially 512bits-wide words from 0 to MAX, with a size BRST for the complex, for NT tests, unless receiving a Stop command the will stop after the current iteration nt, finish

### Highlights

- Template structure
- Two configurations
  - 1. Free Running-mode (from Xilinx)
  - 2. Command-Controllable
- Top Bandwidths
- 1. 78.3,79.9 [Gbit/s]
- 2. 76.9, 79.9[Gbit/s]

RD/WR for 16 MB

```
"72": [
  "Rank: 12",
  "Size: 23",
  "Last BX port: 2718"
    ist RX id: 22",
   Last TX port: 2718",
  "Last TX id: 22",
  "RX packet count: 6026",
   TX packet count: 4017",
  "cFDK/FMC version: 1.0"
  "FPGA uptime: 17:11:06",
  "current ROLE version: 318"
  "Layer 4 (TCP/UDP) is ENABLED.",
  "Layer 6 (Network Routing) is ENABLED.",
  "Layer 7 (ROLE) is ENABLED.",
  "UDP RX drop count: 0",
  "Invalid node-id/ip-address RX count: 0",
  "Invalid port TX count: 0",
  "Invalid node-id/ip-address TX count: 0",
  "Failed creation of TCP connections (TX) count : 0",
  "TCP RX notif drop count: 0",
  "TCP RX meta drop count: 0",
  "TCP RX data drop count: 0",
  "TCP RX CRC drop count: 0",
  "TCP RX Session drop count: 0",
  "TCP RX Out-of-Order drop count: 0"
],
11 7 F 11 - F
```

As context, our platform: The IBM cloudFPGA Platform (19"x2U w/64 FPGAs)

(more information at github.com/cloudfpga)



B. Ringlein / cloudFPGA Team / cFDevOps22 / © 2022 IBM Corporation



- - communication plan
  - and synchronous protocol (MPI)
- Hence, if combined  $\rightarrow$  we can say where packages are missing ullet
  - As part of the "flight recorder data" we know in which state the protocol engine is
  - (To much packets can't be discovered easily, since re- transmissions could occur)

### Communication plan of node 4:

| step | to | from | no.<br>packets  |
|------|----|------|-----------------|
| 2    |    | 3    | 20              |
| 3    | 5  |      | <del>5</del> -3 |
| 4    | 6  |      | 5               |

# Debugging generated by compiler

- Once we know which FPGA node "misbehave", we still have to look into it
- Hence, DOSA automatically generates debug probes between IP cores
  - Because we use standardized interfaces
     between IP cores → easily generate able by
     compiler
  - In VHDL and tcl
- We deploy bitstreams using partial reconfiguration → debug bridge support
- ...then we still have to look at waveforms...

|                                                                                                                            |                                              |                      | D                    | ###<br>ebu<br>###                                                         |
|----------------------------------------------------------------------------------------------------------------------------|----------------------------------------------|----------------------|----------------------|---------------------------------------------------------------------------|
|                                                                                                                            |                                              | DB                   |                      | ila<br>clk<br>pro<br>pro<br>pro<br>pro<br>pro<br>pro<br>pro<br>pro<br>pro |
|                                                                                                                            |                                              | l                    | ,<br>,<br>,          | pro<br>pro<br>pro                                                         |
| 654<br>655<br>657<br>658<br>659<br>660<br>661<br>662<br>663<br>664<br>665<br>666<br>667<br>668<br>669<br>670<br>671<br>672 | #<br># VI<br>set<br>set<br>set<br>set<br>set | ip<br>ip<br>ip<br>ip | Name<br>√enc<br>Libr | lame                                                                      |

```
dosa role 0¶
 => piSHL 156 25Clk¶
obe0
         => siNRC Udp Data tdata¶
obe1
              siNRC Udp Data tkeep¶
         =>
obe2(0)
              siNRC Udp Data tvalid¶
         =>
              siNRC Udp Data tlast¶
obe3(0)
         =>
obe4(0)
         siNRC Udp Data tready¶
obe52
               sMPE Debug
          =>
obe53
               sZRLMPI Wrapper Debug
          =>
obe54(0)
               sResetApps_n¶
          =>
obe55
               sToFifo input 0 tdata din¶
          =>
obe56(0)
               sToFifo input 0 tdata full n¶
          =>
          => sToFifo input 0 tdata full¶
obe57(0)
               sToFifo input 0 tdata write
obe58(0)
          =>
obe59
               sToFifo input 0 tkeep din
          =>
obe60 ( 🛛 )
               sToFifo input 0 tkeep full n¶
          =>
               sToFifo input 0 tkeep full
obe61(0)
          =>
obe62 ( 🛛 )
               sToFifo input 0 tkeep write
          =>
               sToFifo input 0 tlast din
obe63
          =>
               sToFifo input 0 tlast full n
obe64(0)
          =>
 ILA Core¶
 "ila dosa role 0"¶
 "ila"¶
 "xilinx.com"¶
 "ip"¶
 "6.2"¶
 [list CONFIG.C NUM OF PROBES 112 \¶
       CONFIG.C DATA DEPTH 2048 \
        CONFIG.C PROBE0 WIDTH {64}\¶
        CONFIG.C PROBE1 WIDTH {8}\¶
        CONFIG.C PROBE2 WIDTH
        CONFIG.C PROBE3 WIDTH
        CONFIG.C PROBE4 WIDTH
        CONFIG.C PROBE5 WIDTH
        CONFIG.C PROBE6 WIDTH
        CONFIG.C PROBE7 WIDTH {1}\¶
        ONFIG.C PROBE8 WIDTH
```

## Deployment with "one-click"

- Configuring 10+ FPGAs manually could be time consuming...
- We developed a resource manager that deploys clusters of FPGAs based on JSON description
  - Combination of FPGA and CPU nodes possible
  - Automatic configuration of "firewall", routing tables etc.
- "cloudFPGA support package" as a command line tool
- Additionally, we use partial reconfiguration via network (TCP) to parallelize deployments

## \$ cfsp cluster post -description=file.json

### operation JTAG config. of the JTAG partial reconf JTAG partial reconf POST /configur Mantle logic via TC POST /configur app logic via TCP



|                     | file size | total time | effective speed    |
|---------------------|-----------|------------|--------------------|
|                     | in $MiB$  | in seconds | in $\frac{kiB}{s}$ |
| e compl. design     | 24.5      | 55.09      | 455.39             |
| ifig. of Mantle     | 1.8       | 11.07      | 166.43             |
| fig. of app logic   | 12.8      | 30.85      | 424.82             |
| re of partial<br>CP | 1.8       | 0.17       | 10,788.41          |
| re of partial       | 12.8      | 1.07       | 12,215.09          |

# Conclusion?

- After "End of line": To increase performance, systems must become more efficient
   → more specialization
  - $\rightarrow$  more reconfigurable computing
- FPGAs are a valuable option, because:
  - great flexibility, low costs
  - high performance, growing ecosystem
- But: not yet used at scale because
  - "still hard to use"
  - A lot of progress around "proof of concept" but not real end2end use cases
- ➔ better tools, frameworks, compilers necessary
- better re-usability and cooperation in the community
- more open source and research around DevOps!



…looking forward to <u>all</u> your questions! Burkhard Ringlein Surkhard Ringlein Image: Surkhar

# Appendix

B. Ringlein / cloudFPGA Team / cFDevOps22 / © 2022 IBM Corporation





# A Brief Overview of Designing with FPGAs in the Cloud



## Portability: A system view



### **FPGA**



# **Evaluation:** Configuration

## Configuration times on cF:

| operation                                          | file size | total time | effective speed    |
|----------------------------------------------------|-----------|------------|--------------------|
| operation                                          | in $MiB$  | in seconds | in $\frac{kiB}{s}$ |
| JTAG config. of the compl. design                  | 24.5      | 55.09      | 455.39             |
| JTAG partial reconfig. of Mantle                   | 1.8       | 11.07      | 166.43             |
| JTAG partial reconfig. of app logic                | 12.8      | 30.85      | 424.82             |
| POST /configure of partial<br>Mantle logic via TCP | 1.8       | 0.17       | 10,788.41          |
| POST /configure of partial app logic via TCP       | 12.8      | 1.07       | 12,215.09          |

- Joint Test Action Group (JTAG) bus at 5 Mbit/s
- TCP based on 10GbE
- Result:
  - partial reconfiguration via network outperforms classical JTAG approach by a factors of 28 - 65
  - $\rightarrow$  reduced switching-costs / provisioning-time of a service



**FPGA** 

# **Evaluation:** Provisioning

- measured time of cold-boot until application execution by FPGA/CPU, based on requirements of FaaS framework
  - cF outperforms CPU 40 times, and AWS F1 10 times (for AWS time of VM provisioning instead of hardware boot)
- modeled behavior for dynamic request scenarios, the systems can meet the requirements:
  - cF: 97.7%
  - AWS F1: 61.3%
  - CPU: 42.1%

## ]





### **PROVISIONING TIMES**

| boot of    | time       |
|------------|------------|
| boot of    | in seconds |
| J          | 271.10     |
| S EC2 F1   | 82.26      |
| dFPGA (cF) | 6.20       |
|            |            |

# Evaluation: Cold-boot and Switching Costs



- efficiency of FaaS depends (also) on boot and switching times  $\rightarrow$  our main goal
- measured total time to execute three different functions on one device from boot to power off
  - but application agnostic: replacing application execution time with placeholders (10, 25, and 45 sec)
- Result: Mantle architecture finishes before CPU is booted and while AWS F1 is executing the first app
  - cF with Mantle architecture spends close to 90% of the total time on execution

# **References and Notes**

[1] Semiconductor Research Corporation, 'Decadal plan for semiconductors — full report,' Semiconductor Research Corporation, Tech. Rep., Feb. 2021. [2] Pictures from: Doug Burger, "Will Programmable Hardware Reach Scale", Keynote FPL 2020, September 2020. [3] S. Hooker, 'The hardware lottery,' Commun. ACM, vol. 64, no. 12, pp. 58–65, Nov. 2021. DOI: 10.1145/3467017. [4] Bernd Klauer, The convey hybrid-core architecture. (High-Performance Computing Using FPGAs, Springer, New York, 2013)

[5] Screenshot from app.dimensions.ai/, August 2022.

[6] Both from: Nick Brown (EPCC at the University of Edinburgh), "Exploring the acceleration of Nekbone on reconfigurable architectures", Sixth International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC'20), 2020. Presentation slides: https://h2rc.cse.sc.edu/2020/slides/03\_Brown.pdf

Partly, references and sources are given in the slides directly.

All remaining images are from IBM DAM or IBM Websites or created by the author.

Intel, Intel logo, and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

The registered trademark Linux is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on worldwide basis.

IBM and the IBM logo are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on ibm.com/trademark .

## cloudFPGA: Further Reading

- B. Ringlein, F. Abel, D. Diamantopoulos, B. Weiss, C. Hagleitner, M. Reichenbach and D. Fey, "A Case for Function-as-a-Service with Disaggregated FPGAs" in Proceedings of the 2021 IEEE 14<sup>th</sup> International Conference on Cloud Computing (CLOUD 2021), 2021.
- B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey, "Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation" in Proceedings of the IEEE/ACM International Workshop on Heterogeneous Highperformance Reconfigurable Computing (H2RC), 2020.
- B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey, "ZRLMPI: A Unified Programming Model for Reconfigurable Heterogeneous Computing Clusters" in 28th IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM), 2020.
- B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey, "System architecture for network-attached FPGAs in the cloud using partial reconfiguration," in 29th International Conference on Field Programmable Logic and Applications (FPL), 2019.
- F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes, "An FPGA Platform for Hyperscalers," in IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, pp. 29–32, 2017.
- F. Abel, "How do you squeeze 1000 FPGAs into a DC rack?" online at LinkedIn: https://www.linkedin.com/pulse/how-do-you-squeeze-1000-fpgas-dc-rack-francois-abel/
- The cloudFPGA project page at ZRL: https://www.zurich.ibm.com/cci/cloudFPGA/

