Authors: Kangkan Wang, Jin Xie, Guofeng Zhang, Lei Liu, Jian Yang Description: This work addresses the problem of 3D human pose and shape estimation from a sequence of point clouds. Existing sequential 3D human shape estimation methods mainly focus on the template model fitting from a sequence of depth images or the parametric model regression from a sequence of RGB images. In this paper, we propose a novel sequential 3D human pose and shape estimation framework from a sequence of point clouds. Specifically, the proposed framework can regress 3D coordinates of mesh vertices at different resolutions from the latent features of point clouds. Based on the estimated 3D coordinates and features at the low resolution, we develop a spatial-temporal mesh attention convolution (MAC) to predict the 3D coordinates of mesh vertices at the high resolution. By assigning specific attentional weights to different neighboring points in the spatial and temporal domains, our spatial-temporal MAC can capture structured spatial and temporal features of point clouds. We further generalize our framework to the real data of human bodies with a weakly supervised fine-tuning method. The experimental results on SURREAL, Human3.6M, DFAUST and the real detailed data demonstrate that the proposed approach can accurately recover the 3D body model sequence from a sequence of point clouds.