

# 高性能計算基盤

- High Performance Computing Platforms-#8

Analog Computing Implementations in Machine Learning

Renyuan Zhang 2020/06/25



Lab. of Computing Architecture

#1 Why consider AI (VLSI) chips?

## Efficient Processor in IoT

Smart chips: VLSI implementations of machine learning (on-chip learning)



#### How smart is our life?

• Smart life: Through advanced equipment, human being controls the life even the world easily. But what if the environment could respond smartly and automatically to individuals' needs and wishes? [M. Grady et al., *Science*, 2012]



Life style is changing

- Natural & Wild Life (hundreds years ago): do everything by hands
- Mechanical life (100 years): do something by driving machine
- Automatic life (50 years): Some machines operate without driving
- Smart life (soon?):
  - smart and soft agent like a bunch of stewards and servants



## An example

#### Smart home

D. Cook, Science, 2012

**Outer of house** 



inner of house

# To make environment sufficiently smart

 Key point is not automatic response, but make agents "think".



## IoT hardware on VLSI side

A large number of various VLSI devices are required

|                                        | Power      | size       | applications                                                |
|----------------------------------------|------------|------------|-------------------------------------------------------------|
| Stationary devices                     | Watt       | Large      | Displayer, Monitor, Mechanical driver,<br>Stationary Sensor |
| Nomadic devices<br>(Body Area Network) | Mill Watt  | Small/tiny | Mobile ~, Wearable ~, Body-Embedded ~                       |
| Autonomous<br>transducers              | Micro Watt | tiny       | Communication devices                                       |

#### **Central Processing Station (computer?)**

H. De Man, "Ambient Intelligence: Gigascale Dreams and Nanoscale Realities", *IEEE Int. Conf. Solid-State Circuits*, vol. 1, pp 29-35, Feb. 2005.

#2 How analog computing effects AI?

## #2.1 Fully Parallel Analog VLSIs for Implementing Machine Learning

#### VLSI On-chip Training

• What is "on-chip training"?

# $253 \div 11 = ?$ 10010111 + 11011 = ?Rules Faster Bossible but hard Faster Reliably

## VLSI On-chip Training

• What is "on-chip training"?



No existing rules of solution for computer.



#### VLSI On-chip Training

• What is "on-chip training"?

# Give some learning examples, make VLSI system pursue the solving rules by itself.

|   | Processing elements                | Element<br>size | Energy<br>use | Processing speed   | Style of computation     | Fault<br>tolerant | learns   | Intelligent,<br>conscious |
|---|------------------------------------|-----------------|---------------|--------------------|--------------------------|-------------------|----------|---------------------------|
|   | 10 <sup>14</sup><br>synapses       | $10^{-6} m$     | 30W           | 100 <i>Hz</i>      | Parallel,<br>distributed | yes               | yes      | usually                   |
| 2 | 10 <sup>8</sup><br>transistor<br>s | $10^{-6}m$      | (CPU)         | 10 <sup>9</sup> Hz | Serial,<br>centralized   | no                | A little | Not (yet)                 |

#### What is machine learning?

[Wikipedia] To learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.



Why analog?

**Fast calculation (real-time)** 

**Smaller chip area** 

Fully parallel cellular architecture can be built

Non-boolean

Non-clock based

**Error-tolerant** 

#### General review

#### VLSI implementation strategies

|                    | Hardware | Step control                 | Speed Chip size |     | Flexibility |
|--------------------|----------|------------------------------|-----------------|-----|-------------|
| Fully-serial       | Digital  | Clock-based<br>iteration     |                 | • • |             |
| Partially-parallel | Analog   | Clock-based<br>iteration     |                 |     |             |
| Hyper-Parallel     | Analog   | Non-clock<br>Freely-feedback |                 |     | ••          |

#### Outline

# Analog implementations of them:

#### Support Vector Machine

K-Quasi-Centers clustering
 On-line learning strategies
 Data domain description
 Intel Project

# Support Vector Machine (SVM)



# SVM for pattern recognition



# SVM for pattern recognition



# SVM with "Kernel trick" *Non-linear*





### Analog Implementation of SVM



**1.** Complex computation:  $A \cdot e$ 

→ Analog circuits to generate Gaussian function

2. Learning operation: large number of numerical iterations

Fully parallel architecture to avoid clock-based iterations

## Analog Implementation of SVM

## (with Gaussian function kernel)

|                           | Parallelism       | Chip<br>area  | Learning<br>speed    | Dimensi<br>ons | Chip<br>measurement |
|---------------------------|-------------------|---------------|----------------------|----------------|---------------------|
| SY. Peng et al.<br>(2008) | Fully<br>parallel | $\propto N^2$ | One<br>clock         | 2              | N.A                 |
| K. Kang et al.<br>(2010)  | Row-<br>parallel  | $\propto N$   | $\propto N \times i$ | 2              | Available           |
| Target of<br>this work    | Fully<br>Parallel | $\propto N$   | One<br>clock         | 64             | Available           |

**N**: number of learning samples

*İ*: number of numerical iterations

[1] S.-Y. Peng, B. A. Minch, and P. E. Hasler: *Proc. Int. Symp. Circuits Syst.*, 2008, pp. 860 - 863.
[2] K. Kang and T. Shibata: *IEEE Trans. Circuits Syst.*, Vol. 57, no. 7, pp. 1513 – 1524 (2010).

Traditional Gaussian-generation circuit: Bump circuit \*



\*K. Kang and T. Shibata: *IEEE Trans. Circuits Syst.*, Vol. 57, no. 7, pp. 1513 – 1524 (2010).

**Our Proposed Gaussian-generation circuit** 



**Our Proposed Gaussian-generation circuit** 



**Our Proposed Gaussian-generation circuit** 



#### Performance of Our Proposed Gaussian-generation circuit

Center value of Gaussian function feature can be dynamically programmed



#### Performance of Our Proposed Gaussian-generation circuit

Peak-height of Gaussian function feature can be dynamically programmed



#### Performance of Our Proposed Gaussian-generation circuit

Width of Gaussian function feature can be dynamically programmed



#### Performance of Our Proposed Gaussian-generation circuit

Reliability of proposed Gaussian generation circuit:



## Fully parallel architecture for SVM



S.-Y. Peng, B. A. Minch, and P. E. Hasler: Proc. Int. Symp. Circuits Syst., 2008, pp. 860 - 863.

## Fully parallel architecture for SVM



## Fully parallel architecture for SVM



- ✓ Fast learning and self-converging
- ✓Compact chip-area



#### Verification of SVM processor

#### Destailing hypersection DAC session

Simulation



### Verification of SVM processor

#### Chip measurement



#### Performance of SVM Processor

|                               | [1] ISCAS <sup>2008</sup> | $[2] TCAS^{2010}$                | This work                 |
|-------------------------------|---------------------------|----------------------------------|---------------------------|
| Technology                    | Simulation                | $0.18 \mu m \text{ CMOS}$        | $0.18 \mu m \text{ CMOS}$ |
| Operation                     | Learning/Classifying      | Learning/Classifying             | Learning/Classifying      |
| Learning parallelism          | Fully parallel            | Row parallel                     | Fully parallel            |
| Kernel function               | Gaussian                  | Gaussian                         | Gaussian                  |
| No. of kernels                | $4 \times 4 + 4$          | 12                               | $18.8^{*}$                |
| Input vector                  | Analog voltage            | Digital (8 bits)                 | Digital (8 bits)          |
| Number of samples             | 4                         | 12                               | 16                        |
| Number of dimensions          | 2                         | 2                                | $1 \sim 64$               |
| Learning time (ns)            | N/A                       | $12 \times l \times 60^{\sharp}$ | 100                       |
| Classifying speed (vectors/s) | N/A                       | $8.7 \times 10^5$                | $10^{6}$                  |

#### Comparisons

\*The number of kernels is evaluated from the viewpoint of chip area.

 $\ddagger l$  is the number of iterations for convergence.

[1] S.-Y. Peng, B. A. Minch, and P. E. Hasler: *Proc. Int. Symp. Circuits Syst.*, 2008, pp. 860 - 863.
[2] K. Kang and T. Shibata: *IEEE Trans. Circuits Syst.*, Vol. 57, no. 7, pp. 1513 – 1524 (2010).

# Support Vector Machine Summary

Proposed architecture can be actually applied in SVM, with improved performances

R. Zhang and T. Shibata, SSDM, 2011.

R. Zhang and T. Shibata, JJAP, 2012.

#### Outline

## Analog implementations of them:

#### Support Vector Machine

#### ≻K-Quasi-Centers clustering

On-line learning strategies
Data domain description
Intel Project

- Patterns are clustered into categories according to the similarity
- K-means algorithm is widely applied



### ➢Original K-means



(1)Select an initial cluster center

### ➢Original K-means



(2)Compute the distance from the cluster center

### >Original K-means



(3)Form cluster and update cluster center as centroid of patterns



(4) Repeat the step (2) and step(3) Until the stop condition is satisfied.

•Original K-means

Plenty of calculations among vectors, hardly implemented in parallel.

K-Quasi-Centers (KQC) is proposed as the parallel-architecture-friendly version of K-means.

### **>Our basic idea**



(1) Compute all the Euclidean distances and store them

### **>Our basic idea**



(1) Compute all the Euclidean distances and store them

### **≻Our basic idea**



(2) Initialize the clustering randomly

### ≻Our basic idea



(3) Updating the clustering form by scalar calculation

### ≻Our basic idea



(4) Repeating till converge

## K-Quasi-Centers clustering ≻Our basic idea

Learning based on the calculations among scalars Possible to implement in fully-parallel

> Hardware implementation (two clusters)

Switches reflect the cluster label of this pattern // Compute the average distance



Fully-parallel, free and real-time feedback, self-converge

### **Euclidean distance**







### >Translinear (divider)



## Fully parallel KQC processor ≻Image clustering



## Fully parallel KQC processor > Performance

#### Comparisons

|                      | [3]                  | [4]               | This work       |
|----------------------|----------------------|-------------------|-----------------|
| Implementation       | FPGA & CPU           | Digital           | Analog          |
| Number of devices    | N/A                  | 414K gates        | 20K Tran.s      |
| Distance measurement | Manhattan            | Euclidean         | Euclidean       |
| Number of dimensions | 2                    | $1 \sim 8$        | $1 \sim 64$     |
| Number of iterations | 25                   | 16                | self-converging |
| Speed $(vectors/s)$  | $< 4.93 \times 10^6$ | $1.38 	imes 10^6$ | $10 	imes 10^6$ |
| No. of samples       | 2905                 | $76.8 	imes 10^3$ | 16              |

[3] H. M. Hussain et al, NASA/ESA Conf. Adaptive Hardware and Systems, 2011, pp. 246-255.
[4] T.-W. Chen and S.-Y. Chien, IEEE Trans. VLSI Syst., vol. 11, no. 8, pp. 1336-1345 (2011).

# Summary

Proposed architecture can be applied in pattern clustering problems, with improved performances Problems

### Limited hardware resource

# V.S.

### Huge database

**Unpredictable database** 

#### Outline

## Analog implementations of them:

Support Vector Machine

≻K-Quasi-Centers clustering

#### ≻On-line learning strategies

> Data domain description

>Intel Project





Great performance

Easy to train

But difficult to train

But performance is poor

Hardware resource is limited

#### We wish:



**On-line learning strategy is developed for this purpose**  $_{64}$ 

≻Our proposal



#### Half-unsupervised

#### ≻Our proposal



Fortunately, it is possible only in fully-parallel system

#### ➤Our proposal



#### ≻Our proposal



#### ≻Our proposal



≻Our proposal



Learning again

≻Our proposal



*This strategy is suitable for fully parallel hardware implementation* 

#### **Hardware implementations**

#### **On-line SVM**



# **Hardware implementations**

# **On-line SVM**



# Hardware implementations

# **On-line KQC learning**



# **On-line K-means**

Which one is useless? How to cluster them?



Simulation

# Is this kind of hardware feasible to be used in the real-world applications?

An object tracking system was built by our group-member Mr. Zhao, employing an SVM chip fabricated in this work.

# **SVM** in tracking

*Target: find face in video* 



SVM: distinguish face and background



Framework of system was built by Mr. Zhao

Analog SVM chip was employed

# SVM in tracking



Framework of system was built by Mr. Zhao



FPGA: Tracking Framework Analog SVM chip

Analog SVM chip was employed

# **SVM** in tracking

#### Measurement



### **Object Tracking With On-Line Shape Learning**



P. Zhao, R. Zhang, and T. Shibata, "Real-time Object Tracking Algorithm Employing On-Line Support Vector Machine and Multiple Candidate Regeneration," ICAISC, 2012, Poland.

# Summary

- Proposed on-line learning strategy, which is on the basis of fully parallel architecture, helps to improve hardware flexibility and efficiency.
- ✓ Proposed hardware is feasible to use in realworld applications.

<u>R. Zhang</u> and T. Shibata, LNCS, 2012. <u>R. Zhang</u> and T. Shibata, *J. Analog Integrated Circuits and Signal Processing* (Springer), 2012.

P. Zhao, R. Zhang and T. Shibata, LNCS, 2012.

# Outline

# Analog implementations of them:

Support Vector Machine

≻K-Quasi-Centers clustering

>On-line learning strategies

#### Data domain description

Discussion on scalability

# **Standard SVM can do this well:**



### **Standard SVM can do this somehow:**



How about this?



### Support vector domain description (SVDD) was developed

But, expensive to train it



Find a sufficiently compact and fit boundary, including almost all samples  ${}_{86}$ 

**Algorithm:** 

#### Derivations in appendix 2

# To make: $(X_i - \mathbf{a})^T (X_i - \mathbf{a}) \le R^2 + \xi_i$ , With minimum R

Becomes: 
$$\min L = 1 - \sum_{i} \alpha_i^2 - \sum_{i \neq j} \alpha_i \alpha_j K(\mathbb{X}_i, \mathbb{X}_j)$$
  
with:  $\sum_{i} \alpha_i = 1$ 

$$\begin{cases} \alpha_i \leftarrow \frac{1}{2} (\lambda - \sum_{j \neq i} \alpha_j K_{ij}) \\ \lambda \leftarrow \frac{1}{N} (1 + \sum_i \sum_j \alpha_j K(\mathbb{X}_i, \mathbb{X}_j)) \end{cases}$$

# In the real-world applications:





# In the real-world applications:



# Using the fully parallel array:

#### Simulation







# Summary

SVDD is more similar to human perception, it can be implemented by the proposed architecture with mathematical tricks. Last dance

# Gap makes people hop

# End



# Thank you very much.