Pose and Action Recognition (ukita)

人の姿勢?行動認識

カメラで人を撮影した画像には，姿勢や行動などの人の物理的な状態を表す様々な情報が含まれる．中でも，人の姿勢は人間の振る舞いや身体能力を表す基本情報であり，画像認識によりその推定が可能になれば，その推定が可能になれば，多種多様な応用システムが可能になる．本研究では，デジカメやデジタルビデオで誰でも撮影できるような一般的な画像中から，人体の姿勢を推定し，さらにその推定姿勢に基づいてその人間の行動を識別することを目標とする．

過去?現在の具体的な研究テーマを以下に示す．

識別学習された局所方位輪郭特徴量によりパーツ連結性を評価した人体姿勢推定
人体パーツと背景を２値分割した領域特徴量による人体姿勢推定
人体姿勢推定における効率的な学習のための学習サンプル選択
Iterative Action and Pose Recognition using Global-and-Pose Features and Action-specific Models
Occluded Appearance Modeling with Sample Weighting for Human Pose Estimation
Ensemble Convolutional Neural Networks for Pose Estimation
Semi- and Weakly-supervised Human Pose Estimation

?識別学習された局所方位輪郭特徴量によりパーツ連結性を評価した人体姿勢推定

This paper proposes contour-based features for articulated pose estimation. in single images. Most of recent methods, including general object recognition using graphical models, are designed using tree-structured models with appearance evaluation only within the region of each part. While these models allow us to speed up global optimization in localizing the whole parts, useful appearance cues between neighboring parts are missing. Our work focuses on how to evaluate parts connectivity using contour cues (Fig. 1). Unlike previous works, we locally evaluate parts connectivity only along the orientation between neighboring parts within where they overlap. This adaptive localization of the features is required for suppressing bad effects due to nuisance edges such as those of background clutter and clothing textures, as well as for reducing computational cost. Discriminative training of the contour features improves estimation accuracy more. Experimental results with a public dataset of people images verify the effectiveness of our contour-based features (Fig. 2).

**図１：** （左）ベース手法の結果．左腕の推定に失敗．（中央）輪郭画像．提案手法では，胴体から左腕に繋がる輪郭を評価する．（右）提案手法の結果．左腕の推定に成功．

**図２：** 提案手法による姿勢推定結果（左，中央）．ベース手法の結果（右）．
画像下の数字は，推定に成功した人体パーツの割合を示す．

?人体パーツと背景を２値分割した領域特徴量による人体姿勢推定

本研究では，複雑背景下で撮影された一枚の静止画像中から人体の姿勢を推定する手法を提案する．この問題では，画像中の各ウィンドウにおける人体の各パーツ(頭，胴，腕など) らしさ，すなわちウィンドウの見えとパーツの見えの類似度を正しく評価することが中国足球彩票となる．ほとんどの従来手法では，濃度勾配ベースの特徴量のみを用いて各パーツの類似度を計算するため，個人差に依存しにくい類似度評価のために中国足球彩票な人体輪郭だけでなく，服表面のテクスチャや複雑な背景模様も検出されてしまう問題があった．本研究ではこの問題を解決するために，パーツ類似度評価のための2 段階の処理を新たに提案する．まず第1 に，人体のパーツ領域をより正しく抽出するためのパーツ?背景領域分割法を提案する．この方法では各パーツの候補領域において，各パーツ形状の事前知識を参照したパーツ領域と背景領域を２値化をすることによって，人体輪郭だけを領域特徴量として抽出し，この領域特徴量に基づいて各パーツとの類似度を評価する．第2に，領域特徴量と従来の濃度勾配ベースの特徴量とを統合することにより，両特徴量を相補的に利用したパーツ類似度の評価法を提案する．実験の例を図３に示す．

**図３：** 提案手法による姿勢推定結果（左）．ベース手法の結果（右）．
画像下の数字は，推定に成功した人体パーツの割合を示す．

?人体姿勢推定における効率的な学習のための学習サンプル選択

認識のためのモデル学習では，サンプルが多いほど認識性能は向上するが，学習の計算コストはサンプル数に応じて増大してしまう．そこで，学習サンプルを適切に選択し，推定正答率を大きく落とすことのない学習の高速化を提案する．本研究では，人体姿勢推定を研究対象とする．提案法は，一般的な認識問題に適用可能なフレームワークに加え，人体姿勢特徴量に特化した低次元化による高速化も備える点で，従来の手法と異なる．本稿では，サンプルの選択法を３つ提案する．一つはクラスタリングによる選択で，重複的な学習を避ける方法である．二つめは識別境界からの距離による選択で，識別境界の効率的な更新に着目した手法である．三つめは，前述の２つの手法を組み合わせ，時間のかかるクラスタリングと誤検出探索において，人体姿勢推定問題に特化した特徴量の低次元化と枝刈りを加えた．これらの手法について実験を行い，実験の結果，統計的なパラメータに基づいた人体姿勢推定の正答率の低下を3%以下に抑えたまま学習時間を79%削減できた．提案手法による高速学習で適切なモデルを選択し，そのモデルを全サンプルで学習することによって，全体の学習時間の大幅削減と従来とおりの高認識率を両立できる（図４）．

**図４：** 提案法による学習時間削減．赤棒が全サンプルを学習して最高のモデルを得る計算時間．橙棒が，提案法により選択サンプルのみを学習して最高モデルを選び，最高モデルを全サンプルで学習する計算時間．

?Iterative Action and Pose Recognition using Global-and-Pose Features and Action-specific Models

This work proposes an iterative scheme between human action classification and pose estimation in still images (Fig. 5). For initial action classification, we employ global image features that represent a scene (e.g. people, background, and other objects), which can be extracted without any difficult human-region segmentation such as pose estimation. This classification gives us the probability estimates of possible actions in a query image. The probability estimates are used to evaluate the results of pose estimation using action-specific models. The estimated pose is then merged with the global features for action re-classification. This iterative scheme can mutually improve action classification and pose estimation, as illustrated in Fig. 6. Experimental results with a public dataset demonstrate the effectiveness of global features for initialization, action-specific models for pose estimation, and action classification with global and pose features.

**図５：** Pose representation with 10 body parts.