【GNN】SIA-GCN - A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation

Motivation & background

  • probabilistic graphical models could be deployed to enhance structural consistency

  • confidence maps $\Leftrightarrow$ the unary potential functions

  • the graphical model could impose some learned pairwise potential functions on the initial confidence maps, thus enforcing spatial consistency of the body joints/keypoints

  • discrepancy

    • GCN: \(\mathbf{H}^{(l +1)} = \sigma\left( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) \in \mathbb{R}^{N \times C}\)

    • each graph node can be associated with a two dimensional confidence map

  • flattening the two dimensional confidence map to a single long vector

    • very large feature size

    • increase the computational complexity

    • spatial information encoded in the confidence map would be corrupted

  • weight sharing is difficult to characterize different positional relationships for different pairs of neighbouring joints

Spatial information aware graph neural network

Graph $\mathcal{G} = {\mathcal{V}, \mathcal{E}}$,feature matrix $\mathbf{X} \in \mathbb{R}^{N \times W \times H}$, convolution kernel $\mathbf{F} \in \mathbb{R}^{|\mathcal{E}| \times w \times h}$, adjacent matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$ \(\mathbf{X}^{(l+1)} = \sigma \left( \hat{\mathbf{B}} \left( (\mathbf{C} \mathbf{X}^{(l)}) \star \mathbf{F}^{(l)} \right)\right)\)

  • $\mathbf{C} \in \mathbb{R}^{ \mathcal{E} \times N}$: nodes to outcoming edge
    • \[C_{ij} = \left\{ \begin{aligned} 1, ~&A_{jk} = 1\\ 0, ~&\mathrm{otherwise} \end{aligned} \right. (\mathrm{edge} ~ i, \mathrm{node} ~ j)\]
  • $\hat{\mathbf{B}} \in \mathbb{R}^{N \times \mathcal{E} }$: edges to incoming edges
    • \[B_{ij} = \left\{ \begin{aligned} 1, ~&A_{ik} = 1\\ 0, ~&\mathrm{otherwise} \end{aligned} \right. (\mathrm{node} ~ i, \mathrm{edge} ~ j)\]
    • \[\hat{\mathbf{B}} = \mathbf{D}^{-1} \mathbf{B}\]

初步思路

主要的思路是利用GNN、Transformer等方法(结合运动学和逆向运动学约束)解决手部姿态估计任务,旨在提出一种Graph Transformer for Hand Pose Estimation (HGT)。

  • 输入:深度图
  • 输出:手部关节点坐标

具体地,初步技术路线及步骤如下:

  • 特征提取器:拟采用CNN进行特征提取 (常规做法,参见 Ref [1, 2, 5])
  • 2D/3D检测器:特征图检测出关节热图 (常规做法,参见 Ref [1, 5])
  • 构建图网络
    • 每个热图作为一个图节点
    • 通过节点->边->节点特征映射完成卷积,并学习边的权重(注意力模型,参见 Ref [3])
  • 图卷积层嵌入Transformer结合+Non-AutoRegression Decoding机制(待考虑,参见 Ref [4, 6])

研究现状

Graph Transformer已经有多篇文章发表(如Ref [4]等),但是针对Hand Pose Estimation这项任务的还没有,针对Hand Pose Estimation的Transformer结构已经发表(Ref [5]),其思路旨在将深度图转化为点云处理。因此,我们可以借鉴以上论文思路,设计一种针对Hand Pose Estimation任务的Graph Transformer网络,同时在Position Encoding模块考虑融入诸如结构约束等手工特征(类似Ref [6]思路)。

TODO

  • 考虑怎么融合Transformer+Graph+Hand Pose Estimation+…
  • 继续调研文献获取思路…
  • 另外,手部姿态估计任务与手部形状(网格)构建相结合也是一种趋向(Ref [1, 5])

参考文献

[1] Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data

[2] SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation

[3] SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation

[4] A Generalization of Transformer Networks to Graphs

[5] 3D Hand Shape and Pose Estimation from a Single RGB Image (GNN)

[6] Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation

[7] Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks (2D->3D+GNN)

Kanglei Zhou wechat
觉得好、还想看,评论留言点个赞。关注稷殿下,天天都好看!