Graph Attention Networks (ICLR 2018)

5 분 소요

Paper

Abstract

Graph-structured data에서 작동하는 새로운 neural network architectures인 Graph Attention Networks (GATs)를 제시한다. GAT는 masked self-attentional networks 에 중점을 두고, graph convolutions에 기반한 이전 방법들의 단점을 해결한다.

GAT는 이웃의 features에 집중할 수 있는 nodes이 담긴 layers를 쌓아, 주변에 위치한 다른 노드들을 다른 가중치들로 implicitly specify 할 수 있다. 이는 inversion matrix 같은 costly matrix가 없어도 되고, 어떤 그래프 구조를 몰라도 된다. 이러한 방법으로 spectral-based graph neural networks의 핵심 challenges들을 다루고, inductive 와 transductive 문제들에 쉽게 적용 가능하다.

GAT 모델은 SOTA 결과를 달성했다.

Introduction

CNN

CNN은 image classification, semantic segmentation, machine translation 과 같은 문제들에서 성공적으로 적용되었는데, data representation은 grid와 비슷한 구조(grid-like structure)를 가져야 했다. input position 넣어 학습 가능한 파라미터로 local filter를 효율적으로 재사용할 수 있었다. 하지만, 많은 task 들에서 data가 grid-like 구조로 존재하지 않았다. 3D meshes, social networks, telecommunication networks, biological networks, brain connectomes 와 같은 경우, grid-like 구조로는 표현할 수 없고 graph의 형태로 표현할 수 있다.

GNN

그동안 graph 구조를 neural network로 다루는 시도들이 있어왔다. 초기엔 directed acyclic graph 와 같은 그래프 구조로 처리된 데이터들을 처리하는 RNN이 있었다. 이후 RNN의 generalization 으로 cyclic, directed, undirected graph를 다룰 수 있는 GNN이 도입되었다.

Convolution

RNN을 일반화하여 GNN이 소개되었듯이, Convolution을 일반화하려는 시도들도 생겨났다. 이러한 시도들은 크게 두 가지로 나뉘는데, spectral approaches 와 non-spectral approaches 이다.

먼저, Spectral approaches는 graph를 spectral representation으로 나타내는 것이다. 이 관점에서는 Graph convolution이 graph Laplacian의 eigendecomposition 을 계산하는 Fourier domain 으로 정의되는데, 이를 통해 intense한 연산이 가능하고 non-spatially localized filter를 만들 수 있다. 하지만, 이렇게 학습된 filters 은 Laplacian eigenbasis 에 의존하는데, 이는 곧 graph structure에 의존한다는 뜻이다. 그래서 spectral approaches는 특정 구조에서만 훈련된 모델은 다른 구조의 그래프에는 적용될 수 없다는 문제점이 존재한다.

반면에, Non-spectral approaches 에서는 convolution을 graph에서 공간적으로 가까운 이웃들의 그룹 내에서 계산하는 방법으로 정의한다. 다만 이 방법은 다른 크기의 이웃들에서도 작동하고 CNN의 weight sharing property를 유지하는 operator를 정의해야 하는 challenge가 있다. 이를 해결하기 위해 MoNet 과 GraphSAGE가 등장한다. 이 접근법들은 큰 규모의 inductive 벤치마크에서 좋은 성능을 보여주었다.

Attention

Attention mechanisms은 많은 sequence-base tasks 에서 기준이 되어 왔다. Attention은 다양한 크기의 inputs 을 다루면서, 가장 관련있는 input의 일부를 집중한다는 장점이 있다. Single sequence를 계산할 때는 보통 self-attention 이나 intra-attention을 이용하는데, RNN이나 convolution을 사용할 때는 self-attention이 여러 task 들에서 유용하다는 것이 증명되었다.

그래서 이 논문에서는 graph 구조로 된 데이터의 node classification을 수행하는 attention-based architecture 를 소개한다. 이 모델은 노드들의 hidden representation을 이웃들을 집중시키면서(by attending) self-attention을 계산하여 구한다. 병렬적이기에 연산이 효율적이며, 이웃들에게 다른 가중치들을 부여하기 때문에 다른 degree를 갖는 graph nodes 에도 적용시킬 수 있다. 또한, inductive learning problem에 직접적으로 적용될 수 있기 때문에 완전 처음 보는 graph 구조에도 일반화할 수 있다.

GAT Architecture

Graph Attentional Layer

Layer $\mathbf h \rightarrow \mathbf h’ \ (\R^F \rightarrow \R^{F’})$

Node features $\mathbf h = { \vec h_1, \vec h_2, …, \vec h_N }, \vec h_i \in \R^F$ ($N$: Node 수, $F$: Feature 수)
Layer outputs $\mathbf h’ = { \vec h_1’,\vec h_2’, …, \vec h_N’ }, \vec h_i’ \in \R^{F’}$ ($F’$: Hidden embedding 차원)

Masked Self-attention

MSA

Attention coefficients $e_{ij}=a(\mathbf W \vec h_i, \mathbf W \vec h_j )$ that indicates the importance of node $j \rightarrow i$
$j \in \mathcal N_i$ (Neighborhood of node $i$)
$\mathbf W \in \R^{F’ \times F}$: Learnable parameters
$a(\cdot)$: shared attentional mechanism $\R^{F’} \times \R^{F’} \rightarrow \R$ computes attention coefficients

single-layer feedforward neural network 를 사용함
그래프 구조를 넣어주기 위해 masked attention 을 이용함
Normalize: $\alpha_{ij} = \text{softmax}j(e{ij}) = \cfrac{\exp(e_{ij})}{\sum_{k\in \mathcal N_i} \exp(e_{ik})}$

정리하면 다음과 같다:

\[\alpha_{ij}= \frac{\exp \left(\text{LeakyReLU} \left( \vec{\mathbf a}^T [\mathbf W \vec h_i \| \mathbf W \vec h_j ]\right) \right)}{\sum_{k \in \mathcal N_i} \exp \left( \text{LeakyReLU} \left( \vec{\mathbf a}^T[ \mathbf W\vec h_i \| \mathbf W \vec h_k ] \right) \right)}\]

Slope of LeakyReLU = 0.2
$\cdot ^T$ represents transposition and $|$ is the concatenation operation

Multi-head Attention

MHA

Final output of node $\vec h_i’ = \sigma \left( \sum_{j\in \mathcal N_i} \alpha_{ij} \mathbf W \vec h_j \right)$
하지만 multi-attention 이 더 효율적이기에 다음과 같이 바뀔 수 있다.
\[\vec h_i' = \Big\Vert^K_{k=1} \sigma \left ( \sum_{j \in \mathcal N_i} \alpha_{ij} \mathbf W \vec h_j\right )\]
근데 이러면 $\mathbf h’ \in \R^{KF’}$ 가 되어버리고, 그냥 concatenation 시키는건 no longer sensible 하다.
그래서 Averaging 을 하고 final nonlinearity 를 delay applying 시켜 다음과 같이 사용한다.
\[\vec h_i' = \sigma \left( \frac{1}{K} \sum^K_{k=1} \sum_{j \in \mathcal N_i} \alpha^k_{ij} \mathbf W^k \vec h_j \right)\]
본 논문에서는 $K=3$을 사용하며, 그림에서 화살표를 구분하였다.

계산이 매우 효율적이다. 모든 edges 와 nodes 에 대해 병렬적으로 계산되고, eigendecompositions 나 similar costly matrix operations 이 필요하지 않다. Single GAT attention head 기준 계산 복잡도는 $O(

F F’+

F’)$ 으로, GCN과 거의 비슷하다. Multi-head attention 을 사용하면 storage 와 parameter requirements 가 $K$배 증가하긴 하나, 독립적이고 병렬적으로 계산된다.

GCN과 달리, 같은 이웃에 있는 노드들에게 다른 중요도를 할당한다. 이는 Model capacity를 늘릴 수 있으며 해석에도 유용하다.
Attention mechanism은 모든 edges 에 공유된 방식으로 적용되기 때문에, 전체적인 그래프 구조를 접근하지 않아도 학습할 수 있다. 이는 그래프가 꼭 undirected 일 필요는 없다는 말도 의미한다. ($\alpha_{ij}$는 $i \rightarrow j$ 를 의미하므로) 또한, 그래프를 다 보지 않고서도 평가할 수 있기에 inductive learning 에 직접적으로 활용될 수 있다.
이전 연구와 달리 이웃 전체를 볼 수 있고, 이웃 내 노드들의 순서를 정하지 않아도 된다.
MoNet에서 node 의 structure properties 를 사용하기 위해 그래프 구조를 알아야 했던 것과 달리, GAT에서는 node features 를 similarity computations 을 위해 사용하기에 그럴 필요가 없다.

sparse matrix operations 를 시도해보려 했으나 당시 tensor 를 계산해주는 framework 에서 rank-2 tensors 만 지원해서 시도하지 못하였다.

Evaluation

Results

Transductive tasks: mean classification accuracy 측정했는데 세 데이터셋에 대해 SOTA 성능을 냈음
Inductive tasks: 두 unseen test graph에 대해 micro-averaged $F_1$ score 를 측정했는데 SOTA 성능을 냈음
Const-GAT (constant attention mechanism) 와 비교했을 때 성능 개선이 있는 것으로 보아 다른 이웃들에게 다른 가중치를 부여하는 것이 중요하다는 것을 확인함

t-SNE

Node features 를 t-SNE 를 통해 2차원으로 시각화했는데, node classification 이 잘 되고 있음을 볼 수 있다.

Conclusions

Masked self-attentional layer를 활용하여 그래프 구조 데이터에 적용할 수 있는 새로운 convolutional-style neural networks 인 Graph Attention Networks (GATs)를 제시하였다. 이 모델은 계산이 매우 효율적이며, 다른 노드들에게 다른 중요도를 부여할 수 있고, 그래프의 전체 구조를 알지 않아도 학습이 가능하다. 이를 통해 transductive 와 inductive 둘 다 node classification benchmarks 에서 SOTA 성능을 냈다.

Future works

더 큰 batch sizes에 서 다룰 수 있도록 개선할 수 있다. (sparse matrix operation)
Attention mechanism을 통해 모델을 해석해볼 수 있다.
Edge features 가 있는 그래프를 다뤄볼 수 있도록 개선할 수 있다.

Twitter Facebook LinkedIn

Jiwon

Graph Attention Networks (ICLR 2018)

Paper

Abstract

Introduction

CNN