What is human pose estimation?
It is easy for a human to tell whether a person is squatting down or waving a hand. For a machine? Not so simple. When shown a photo or a video frame, a computer “sees” a collection of pixels.
Human pose estimation technology enables a computer to model human bodies and recognize body postures in images and videos, including real-time footage.
How does human pose estimation work?
Traditionally, human pose estimation relied on machine learning algorithms, specifically random forests. The essence of the ML-based approach boiled down to presenting a human body as a collection of parts arranged in a deformable structure, such as the one below.
But the classical, ML-based method struggled to identify hidden body parts and failed at multi-person pose estimation. That has sparked the transition to deep learning that is used for human pose estimation today.
There are many ways to interpret a human posture with deep learning (hence, the many tools and libraries, which we explore in the subsequent section). Here’s how a typical deep learning model for human pose estimation runs.
A standard deep learning model for human pose estimation has a convolutional neural network (CNN) at its base. And the CNN’s architecture comprises two key components—an encoder (in some approaches referred to as an estimator) and a decoder (or a detector).
An encoder extracts specific features, called keypoints, from an input image. The number of the keypoints depends on the approach and usually ranges from 17 to 33. Examples of keypoints might include a left elbow, a right knee, or a neck base. The encoder runs the input through a set of narrowing convolution blocks and extracts the features of all people in the image.
A decoder then creates a keypoint probability heatmap. It helps refine the encoder’s predictions and estimates how likely the extracted features are to be located in the marked areas of the image.
So, the CNN maps the pre-estimated keypoints to the heatmap and outputs those with the highest probability rate. It then groups the associated keypoints into a skeleton-like body structure.
For multi-person estimation, when there are several high-probability areas for each keypoint, say, two left elbows, an additional post-processing layer may be added to ensure each keypoint belongs to the right person.
The method described above follows a bottom-up pipeline. It means that a CNN locates every instance of a particular keypoint first and then groups the associated ones into a body structure.
A model may follow a top-down approach, too. In this case, a CNN first locates all the bodies in an image, putting them into bounding boxes, and then detects the keypoints within each bounding box.
A rundown of essential human pose estimation frameworks and libraries
For the purpose of this article, we’ll focus on CNN-based solutions for human pose estimation since they boast superior performance. Let’s review the most popular deep learning-based human pose detection tools and explore common open-source and proprietary libraries that could ease the implementation of human pose estimation.
The most notable and widely used human pose estimation frameworks
OpenPose
One of the earliest and most popular frameworks for real-time and multi-person 2D human pose estimation, OpenPose relies on a convolutional neural network of two branches. One creates confidence maps for each keypoint found in an image. The other estimates the degree of association between the keypoints, predicting the so-called affinity fields. Combining the output from two branches, OpenPose builds a model of the human skeleton. And the subsequent layers of the network are used to refine the prediction.
YOLOPose
Based on the popular YOLO object detection framework, YOLOpose is a revolutionary heatmap-free approach for joint detection and 2D multi-person human position estimation. Though YOLO-pose operates as a single-shot approach like other bottom-up approaches, it doesn’t use heatmaps. Since each bounding box has an associated pose, YOLOpose doesn’t need heatmap post-processing of bottom-up to parse the detected keypoints into a human pose. This stands in contrast to multi-stage pose estimation techniques such as OpenPose, which identify body components first and then put them together into complete poses. Unlike top-down methods, instance cropping from bounding boxes is not required as all individuals are concurrently localized along with their poses in a single inference.
YOLOpose works well in situations where simplicity and speed are key considerations. Due to its ability to detect human body joints keypoints and deliver full-body skeletal pose information at high speeds, YOLOpose works well for live applications like gaming, surveillance, and fitness tracking.
AlphaPose
AlphaPose is a popular framework for human pose estimation designed to produce more accurate and robust pose estimations in complex, multi-person real-world conditions such as crowded environments and occlusions, where traditional methods may struggle to differentiate between individuals.
AlphaPose first identifies and locates people in an image using region proposal networks (RPN). Then it employs the single-person pose estimation (SPPE) module to estimate the keypoints of each individual. To further improve accuracy, AlphaPose integrates pose-guided proposals, which refines the bounding boxes around each detected person. To eliminate redundant or inaccurate pose predictions and ensure the final pose is accurately assigned to the correct individual, the framework uses pose non-maximum suppression (Pose NMS), which filters out overlapping and duplicate human pose estimations. Its modules for handling occlusions, pose refinement, and real-time processing make this human pose detection method particularly useful in such applications as sports analysis, fitness tracking, and surveillance where robustness and precision are critical.
HRNet (High-Resolution Network)
One of the most accurate frameworks for both single-person and multi-person pose recognition is HRNet, a more recent technique that keeps high-resolution representations throughout the entire process. It stands out for its ability to simultaneously process lower-resolution feature maps in parallel branches while maintaining high-resolution feature maps. This multi-scale processing enables HRNet to capture both fine-grained local features and coarse global content. HRNet’s parallel branches facilitate the efficient integration of global and local features, making the body pose estimation model resilient to variations in pose complexity, occlusion, and scale.
Hourglass Networks
This is a popular 2D human pose estimation model that uses an hourglass-shaped network architecture to repeatedly down- and up-sample images, refining pose estimates at each stage. At the downsampling phase, the encoder reduces the spatial dimensions of the input feature maps to capture global context. At the upsampling phase, the decoder progressively increases the spatial resolution of the feature maps to restore the high-resolution details and refine the locations of keypoints. This results in increased accuracy, particularly in complex and challenging scenarios.
PoseNet
PoseNet is a lightweight architecture-based real-time human pose estimation system primarily designed for mobile applications. Its ability to estimate keypoints directly from images or video streams makes it ideal for applications with limited computational resources. PoseNet identifies 17 body keypoints, such as the shoulders, elbows, wrists, hips, knees, ankles, and ears. These keypoints represent the key joints or landmarks on the human body that are used to comprehend posture and movement. While not as accurate as other human pose detection models, PoseNet is highly useful in applications where speed and efficiency are more critical than precise keypoint localization.
Mask R-CNN
With the Mask R-CNN human pose estimation model, the localization of a human body and keypoint detection run independently. A standard convolutional neural network extracts the keypoints, while a Region Proposal Network (RPN) locates the body. The extracted features are then passed into the parallel branches of the network that refine the predictions and generate body segmentation masks. Based on these masks, the network builds skeleton-like models for each person in the image.
Libraries for speedier implementation of human pose estimation
BlazePoze
BlazePoze is a human pose detection library developed and supported by Google. The library allows crafting robust human pose estimation engines that run in real time. In contrast to similar approaches that typically rely on COCO topology and detect 17 keypoints, BlazePose’s skeleton structure features 33 keypoints, including hands and feet, which is particularly beneficial for fitness and sports applications.
The keypoints are characterized by x and y coordinates, visibility, and vertical alignment. Unlike similar libraries that only rely on heatmaps for keypoint refinement, BlazePose uses a regression approach that combines heatmaps and offset prediction.
MoveNet
MoveNet is an open-source library for 3D human pose estimation that detects 17 keypoints. The library comes in two variants: Lightning and Thunder. The former is intended for latency-critical applications, while the latter is appropriate for apps that require high accuracy. Both Lightning and Thunder run faster than in real time on desktop, laptop, and mobile devices.
The MoveNet architecture comprises a feature extractor and a set of four prediction heads—a person center heatmap, a keypoint regression field, a person keypoint heatmap, and a 2D keypoint offset field.
CenterNet
CenterNet is a point-based object detection framework that can be extended to human pose estimation. CenterNet models people as a single point corresponding to the center point of a body’s bounding box. The model then determines other object characteristics, such as size, 3D location, orientation, and pose.
Where is human pose estimation applied?
Recognizing human posture and movements has long been in focus for major industries, including sports, retail, and entertainment. Here’s a run-through of sectors where human posture estimation is in use.
Fitness
The pandemic has pushed more people to practice physical activities at home, making the demand for fitness apps with human pose estimation grow rapidly. HPE-powered fitness applications provide detailed live feedback on whether a user performs an exercise correctly. For that, a human pose estimation component compares a model extracted from camera footage with a benchmark, thus providing for a safer home workout routine. Fitness apps featuring human pose estimation cater to various activities, from yoga to dancing to weight lifting.
If you are developing fitness apps, there are available solutions that you can integrate into your application. For example, MotionMind offers powerful pose estimation capabilities, which can be used in fitness, public safety, healthcare, and many other industries.
Professional sports
Human pose estimation technology can help athletes improve their performance and assist judges in rating athletes unbiasedly. HPE-powered applications are applied for various tasks—from assessing the quality of figure skating elements to helping soccer players strike perfect kicks to allowing high jumpers to polish up their techniques.
Gaming and filmmaking
Character animation has long been an exhaustive and complex process. Today, it is facilitated with the help of human pose estimation. Graphics, textures, and other enhancements can now be easily applied to a tracked body, so the graphics render naturally even if the body actively moves.
In interactive video gaming, human pose estimation is used to capture players’ motions and render them into the actions of virtual characters as well.
Retail
Whether trying to curb the effect of the pandemic or realize their vision of a supermarket of the future, retailers have started turning to AR and real-time virtual effects. Human pose estimation backs up those aspirations, enabling such experiences as virtual try-on and real-time marketing. An HPE-powered app, whether running on a customer’s mobile phone or integrated into a fitting room’s mirror, allows scanning a person’s body and imposing 3D virtual elements on the estimated posture. And that works for trying out everything from clothes to shoes to jewelry.
Robot training
Traditionally, industrial robots were trained with the help of 2D vision systems that called for time- and effort-intensive calibration. Today, human pose estimation allows for faster, more responsive, and more accurate robot training. Instead of programming robots to follow the set trajectories, one may teach a robot to recognize the pose and the motions of a human. Having estimated the posture of the demonstrator, a robot then devises the way it should move its articulators to perform the same motion.
Security and surveillance
Human pose estimation may be applied to analyze the footage from security cameras to prevent potentially alarming situations. Identifying a human posture and estimating its anomaly score, HPE-powered security software may predict suspicious actions or identify people who have fallen down or, say, potentially feel sick.
Implementing human pose recognition: the peculiarities to keep in mind
ITRex Group has recently helped a fitness tech startup create a fitness mirror powered by artificial intelligence and human pose estimation. We sat down to talk to Kirill Stashevsky, the ITRex CTO, to discuss the specifics of implementing human pose estimation technology that contribute a lot to the project’s success but are often overlooked.
— How does one embark on the HPE implementation journey to ensure they produce a top-notch solution? What should one beware of during project planning to secure that further development efforts are headed in the right direction?
Kirill: The decisions you make at the start of the project will have a significant impact on whether or not you create a successful human pose estimation product. One such decision is selecting the optimum implementation strategy—i.e., developing a HPE solution from scratch or using one of the many human pose estimation libraries.
To choose the best-fitting approach, you need to clearly understand, among other issues, what exactly you aim to achieve with your future product, which platforms it will run on, and how much time you have until releasing your product to the market. Once you’ve clarified the vision, weigh it against the available strategies.
Consider going the custom route if the task you are solving is narrow and non-trivial and requires the ultimate accuracy of human pose estimation. Keep in mind, however, that the development process is likely to be time- and effort-intensive.
In turn, if you are developing a product with a mass-market appeal or a product that caters to a typical use case, going for library-based development would help build a quality prototype faster and with less effort. Still, in many cases, you would have to adjust the given model to your specific use case by further training it on the data that best represents real-world scenarios.
— Suppose I decide to go for library-based development; what factors should I consider to choose the right one?
Kirill: You may go for a proprietary or an open-source library. Proprietary libraries could ensure more accurate pose estimation and require less customization. But you have to prepare a backup plan in case the vendor, say, discontinues the library support.
Open-source libraries, in turn, often require more effort to configure. But with an experienced team, it may be an optimum option balancing the quality of recognition, moderate development costs, and fair time-to-market.
Pay attention to the number of keypoints a library is able to recognize, too. A solution for dancers or yogis, for instance, may require identifying additional keypoints for hands and feet so that BlazePose might look like a more reasonable option. If latency is critical, select the library that runs at an FPS rate of 30 and higher, for example, MoveNet.
— Why aren’t the standard datasets lying at the base of most models enough for an accurately performing solution? What data should I then use to further train and test the model?
Kirill: A well-performing human pose estimation model should be trained on the data that is authentic and representative of reality. The truth is that even the most expansive datasets often lack diversity and fail to yield reliable outcomes in real-life settings. That’s what we faced when developing the human pose estimation component for our client’s fitness mirror.
To mitigate the issue, we had to retrain the model on additional video footage filmed specifically to reflect the surroundings the mirror will be used in. So, we compiled a custom dataset of videos featuring people of different heights, body types, and skin colors exercising in various settings—from poorly lit rooms to spacious fitness studios—filmed at a particular angle. That helped us significantly increase the accuracy of the model.
— Are there any other easy-to-overlook issues that still influence the accuracy of human pose estimation?
Kirill: Focusing on the innards of deep learning, development teams may fail to pay due attention to the cameras. So, make sure that the features of a camera (including its positioning, frame size, a frame rate, a shooting angle, as well as whether the camera is shooting statically or dynamically) used for filming training data match those of cameras real users might employ.
Let's recap
So, you are considering implementing a solution with human pose estimation features. Here are vital things to keep in mind to develop a winning application:
-
Make sure that the strategy you choose to go by supports your vision. Go custom if you’re developing a solution for a specific, non-trivial task. Alternatively, consider implementing your future solution based on a readily available library — either open-source or proprietary
-
When choosing an optimum human pose estimation library, pay particular attention to the number of keypoints the offered model can recognize and the processing speed, especially if your future application is latency-critical
-
Whether you decide to develop a human pose estimation application from scratch or opt for library-based development, you will have to train a model further to guarantee it performs accurately
-
When training the model, use data that is as diverse and representative of reality as possible
-
At the very start of the project, make sure you have enough data to test the model