Motivation & background

probabilistic graphical models could be deployed to enhance structural consistency
confidence maps $\Leftrightarrow$ the unary potential functions
the graphical model could impose some learned pairwise potential functions on the initial confidence maps, thus enforcing spatial consistency of the body joints/keypoints
discrepancy
- GCN: $\mathbf{H}^{(l +１)}　= \sigma\left( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) \in \mathbb{R}^{N \times C}$
- each graph node can be associated with a two dimensional confidence map
flattening the two dimensional confidence map to a single long vector
- very large feature size
- increase the computational complexity
- spatial information encoded in the confidence map would be corrupted
weight sharing is difficult to characterize different positional relationships for different pairs of neighbouring joints

Spatial information aware graph neural network

Graph $\mathcal{G} = {\mathcal{V}, \mathcal{E}}$，feature matrix $\mathbf{X} \in \mathbb{R}^{N \times W \times H}$, convolution kernel $\mathbf{F} \in \mathbb{R}^{|\mathcal{E}| \times w \times h}$, adjacent matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$ $\mathbf{X}^{(l+1)} = \sigma \left( \hat{\mathbf{B}} \left( (\mathbf{C} \mathbf{X}^{(l)}) \star \mathbf{F}^{(l)} \right)\right)$

$\mathbf{C} \in \mathbb{R}^{ \mathcal{E} \times N}$: nodes to outcoming edge
- \[C_{ij} = \left\{ \begin{aligned} 1, ~&A_{jk} = 1\\ 0, ~&\mathrm{otherwise} \end{aligned} \right. (\mathrm{edge} ~ i, \mathrm{node} ~ j)\]
$\hat{\mathbf{B}} \in \mathbb{R}^{N \times \mathcal{E} }$: edges to incoming edges
- \[B_{ij} = \left\{ \begin{aligned} 1, ~&A_{ik} = 1\\ 0, ~&\mathrm{otherwise} \end{aligned} \right. (\mathrm{node} ~ i, \mathrm{edge} ~ j)\]
- \[\hat{\mathbf{B}} = \mathbf{D}^{-1} \mathbf{B}\]

初步思路

主要的思路是利用GNN、Transformer等方法（结合运动学和逆向运动学约束）解决手部姿态估计任务，旨在提出一种Graph Transformer for Hand Pose Estimation (HGT)。

输入：深度图
输出：手部关节点坐标

具体地，初步技术路线及步骤如下：

特征提取器：拟采用CNN进行特征提取 (常规做法，参见 Ref [1, 2, 5])
2D/3D检测器：特征图检测出关节热图 (常规做法，参见 Ref [1, 5])
构建图网络
- 每个热图作为一个图节点
- 通过节点->边->节点特征映射完成卷积，并学习边的权重（注意力模型，参见 Ref [3]）
图卷积层嵌入Transformer结合+Non-AutoRegression Decoding机制（待考虑，参见 Ref [4, 6]）

研究现状

Graph Transformer已经有多篇文章发表（如Ref [4]等），但是针对Hand Pose Estimation这项任务的还没有，针对Hand Pose Estimation的Transformer结构已经发表（Ref [5]），其思路旨在将深度图转化为点云处理。因此，我们可以借鉴以上论文思路，设计一种针对Hand Pose Estimation任务的Graph Transformer网络，同时在Position Encoding模块考虑融入诸如结构约束等手工特征（类似Ref [6]思路）。

TODO

考虑怎么融合Transformer+Graph+Hand Pose Estimation+…
继续调研文献获取思路…
另外，手部姿态估计任务与手部形状（网格）构建相结合也是一种趋向（Ref [1, 5]）

参考文献

[1] Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data

[2] SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation

[3] SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation

[4] A Generalization of Transformer Networks to Graphs

[5] 3D Hand Shape and Pose Estimation from a Single RGB Image (GNN)

[6] Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation

[7] Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks (2D->3D+GNN)