Authors: Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, Wenjun Zhang Description: For monocular 3D pose estimation conditioned on 2D detection, noisy/unreliable input is a key obstacle in this task. Simple structure constraints attempting to tackle this problem, e.g., symmetry loss and joint angle limit, could only provide marginal improvements and are commonly treated as auxiliary losses in previous researches. Thus it still remains challenging about how to effectively utilize the power of human prior knowledge for this task. In this paper, we propose to address above issue in a systematic view. Firstly, we show that optimizing the kinematics structure of noisy 2D inputs is critical to obtain accurate 3D estimations. Secondly, based on corrected 2D joints, we further explicitly decompose articulated motion with human topology, which leads to more compact 3D static structure easier for estimation. Finally, temporal refinement emphasizing the validity of 3D dynamic structure is naturally developed to pursue more accurate result. Above three steps are seamlessly integrated into deep neural models, which form a deep kinematics analysis pipeline concurrently considering the static/dynamic structure of 2D inputs and 3D outputs. Extensive experiments show that proposed framework achieves state-of-the-art performance on two widely used 3D human action datasets. Meanwhile, targeted ablation study shows that each former step is critical for the latter one to obtain promising results.