When we start to talk about hand tracking and gesture recognition, they can be easily confused for each other, so it’s worthwhile to start with a brief explanation of both, what they are and how they differ. Both are methods which allow users to use their hands to interact with computers, without the need for touch, controllers or devices. In some cases, hand tracking or gesture recognition systems use markers, gloves or sensors, but for the most part the ideal system doesn’t require the user to touch anything. These kinds of systems have applications in medical interfaces for example, allowing a surgeon to view and interact with a screen without needing to touch it, or in augmented reality headsets, where the user may be wearing a headset with a digital overlay on the real world that they need to interact with.
While there is some overlap, gesture recognition systems and hand tracking systems have one fundamental difference – in most cases, a gesture recognition system recognizes specific gestures and only those gestures, for example, using a thumbs up gesture to indicate an “ok” or click, or a flat hand to indicate “stop.” A gesture-based system is usually limited to a specific number of gestures, since people have a hard time remembering more than a few gestures, but for those limited number of hand poses, the gesture system will usually recognize them fairly robustly.
A hand tracking system however, usually has a more variable number of interactions that it can support, since hand tracking systems usually track either the volume of the hand, or the individual joint and finger positions. This allows for a theoretically unlimited number of interactions with digital objects, just as we would use our hands to interact with objects in the real world, but can sometimes run into problems with occlusion – what does the system do when your fist is closed? Where are your fingers when one hand is behind the other? Is that your left hand palm up, or your right hand palm down? Because gesture systems are trained on a few specific gestures, they don’t have the same problems, but they also don’t have the same flexibility.
As an analogy that may help to understand the difference between hand tracking and gesture recognition, think of the difference between a phone with a keypad, and a modern touchscreen. The keypad has defined buttons, a limited number of them, which perform a specific set of interactions very well, but a touchscreen can perform an unlimited number of actions, although some of the time you might more easily hit the wrong key or letter because of the more nuanced nature of the interactions. Neither system is inherently better or worse than the other, instead, it’s important to choose which is optimal for your specific use case and user needs.
In the broadest sense, most modern hand tracking solutions use a machine learning approach in order to develop a robust system for detecting hand positions. In general, machine learning systems utilize known, labeled data, in order to allow a computer to predict unknown but similar data. For example, by labeling hundreds of images of cats and dogs, a system might be able to distinguish between cats and dogs with reasonably high accuracy. By using a similar technique to label depth images of hands, it’s possible to detect finger positions with reasonably high accuracy.
In this paper published by IEEE, the authors propose using Intel RealSense Depth cameras combined with colored gloves to accurately create the dataset for hand segmentation, the crucial first step in building a hand tracking database. When using the word segmentation with regard to depth cameras, in general this refers to segmenting a specific foreground object or objects from unimportant background elements. For example, background segmentation can be used to extract a person from their background without the need for a green screen.
By using colored gloves, the authors of the paper were able to quickly and easily generalize between left hand and right hand, as well as distinguish individual fingers when they are overlapping or interacting with each other such as interlaced fingers. Their automatic annotation method reduces the need for human interaction with the data and should lead to more advanced and accurate hand tracking systems.
For augmented reality glasses, it’s important that a user can easily interact with the digital items without needing to use controllers or other physical interfaces. Ideally, every surface becomes a potential touchscreen, every pen could be a stylus. One of the more challenging aspects of hand tracking for augmented reality specifically is that users can still see their own physical hands, with zero latency, so any system requires the estimation algorithm to be fast and low latency. In addition, since the environment and background are complete unknowns, the hand tracking system must be robust to a cluttered background.
In the paper “Real-Time Hand Model Estimation from Depth Images for Wearable Augmented Reality Glasses” the team propose using an Intel RealSense depth camera with an algorithm that was designed to perform best when used as part of an augmented reality headset – for example, they make some assumptions such as the placement of wrists being always in the lower half of the depth image frame. With their algorithm, the authors of the paper were able to get hand tracking accuracy of 85-98% depending on background objects.
There are a variety of reasons that using EMG sensors to track finger movement are desirable. Non-invasive, on-skin electrodes are used to register muscle activity. To date, while this is an interesting technique, it is not a very accurate one, in part because electrodes must be placed repeatedly and accurately in the same position on the forearm, something difficult to achieve outside of laboratory conditions. In this paper, the team used 24 electrodes fixed around the forearm of experiment test subjects using 3 elastic bands with 8 electrodes on each. The experiment also included an Intel RealSense depth camera pointed at the subject’s hands as they move through a series of defined and predetermined motions. By combining the data from the array of sensors with the depth images as ground truth, they were able to correlate finger position with the electrical signals from the muscles, allowing them to create a public dataset for use in a variety of applications.
An example of a useful application for EMG sensor-based systems is for prosthetic hands – many current systems require precise anatomical data, precise electrode placement. This limits the amount of control a user might have over their prosthesis – limiting them to predefined grips or gestures. Again using a machine learning approach, combining an Intel RealSense depth camera with the EMG sensors, the authors of this paper were able to propose an alternative to conventional EMG sensor placement with an array EMG system to cover the user’s forearm, allowing detection of muscle movement deep within the user’s forearm, and better able to track the finger angles with more precision. As this work progresses, it could lead to increasingly nuanced prosthetic control.
Subscribe here to get blog and news updates.
In a three-dimensional world, we still spend much of our time creating and consuming two-dimensional content. Most of the screens
A huge variety of package shapes, sizes, weights and colors pass through today’s e-commerce fulfilment or warehouse distribution centers. Using
Let’s talk about how Intel RealSense computer vision products can enhance your solution.
We'll be in touch soon.