Motivation & background
-
probabilistic graphical models could be deployed to enhance structural consistency
-
confidence maps $\Leftrightarrow$ the unary potential functions
-
the graphical model could impose some learned pairwise potential functions on the initial confidence maps, thus enforcing spatial consistency of the body joints/keypoints
-
discrepancy
-
GCN: \(\mathbf{H}^{(l +1)} = \sigma\left( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) \in \mathbb{R}^{N \times C}\)
-
each graph node can be associated with a two dimensional confidence map
-
-
flattening the two dimensional confidence map to a single long vector
-
very large feature size
-
increase the computational complexity
-
spatial information encoded in the confidence map would be corrupted
-
-
weight sharing is difficult to characterize different positional relationships for different pairs of neighbouring joints
Spatial information aware graph neural network
Graph $\mathcal{G} = {\mathcal{V}, \mathcal{E}}$,feature matrix $\mathbf{X} \in \mathbb{R}^{N \times W \times H}$, convolution kernel $\mathbf{F} \in \mathbb{R}^{|\mathcal{E}| \times w \times h}$, adjacent matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$ \(\mathbf{X}^{(l+1)} = \sigma \left( \hat{\mathbf{B}} \left( (\mathbf{C} \mathbf{X}^{(l)}) \star \mathbf{F}^{(l)} \right)\right)\)
-
$\mathbf{C} \in \mathbb{R}^{ \mathcal{E} \times N}$: nodes to outcoming edge - \[C_{ij} = \left\{ \begin{aligned} 1, ~&A_{jk} = 1\\ 0, ~&\mathrm{otherwise} \end{aligned} \right. (\mathrm{edge} ~ i, \mathrm{node} ~ j)\]
-
$\hat{\mathbf{B}} \in \mathbb{R}^{N \times \mathcal{E} }$: edges to incoming edges - \[B_{ij} = \left\{ \begin{aligned} 1, ~&A_{ik} = 1\\ 0, ~&\mathrm{otherwise} \end{aligned} \right. (\mathrm{node} ~ i, \mathrm{edge} ~ j)\]
- \[\hat{\mathbf{B}} = \mathbf{D}^{-1} \mathbf{B}\]
初步思路
主要的思路是利用GNN、Transformer等方法(结合运动学和逆向运动学约束)解决手部姿态估计任务,旨在提出一种Graph Transformer for Hand Pose Estimation (HGT)。
- 输入:深度图
- 输出:手部关节点坐标
具体地,初步技术路线及步骤如下:
- 特征提取器:拟采用CNN进行特征提取 (常规做法,参见 Ref [1, 2, 5])
- 2D/3D检测器:特征图检测出关节热图 (常规做法,参见 Ref [1, 5])
- 构建图网络
- 每个热图作为一个图节点
- 通过节点->边->节点特征映射完成卷积,并学习边的权重(注意力模型,参见 Ref [3])
- 图卷积层嵌入Transformer结合+Non-AutoRegression Decoding机制(待考虑,参见 Ref [4, 6])
研究现状
Graph Transformer已经有多篇文章发表(如Ref [4]等),但是针对Hand Pose Estimation这项任务的还没有,针对Hand Pose Estimation的Transformer结构已经发表(Ref [5]),其思路旨在将深度图转化为点云处理。因此,我们可以借鉴以上论文思路,设计一种针对Hand Pose Estimation任务的Graph Transformer网络,同时在Position Encoding模块考虑融入诸如结构约束等手工特征(类似Ref [6]思路)。
TODO
- 考虑怎么融合Transformer+Graph+Hand Pose Estimation+…
- 继续调研文献获取思路…
- 另外,手部姿态估计任务与手部形状(网格)构建相结合也是一种趋向(Ref [1, 5])
参考文献
[1] Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data
[2] SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation
[3] SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation
[4] A Generalization of Transformer Networks to Graphs
[5] 3D Hand Shape and Pose Estimation from a Single RGB Image (GNN)
[6] Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation
[7] Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks (2D->3D+GNN)