LINK Gestural and Cinematic Overview

Input Styles

Keyboard : Qwerty keyboards and simpler TV style remotes with cursor keys both have a useful role for data entry. We will see this being replaced with voice more and more over time.

Voice : Voice commands are less explicit than a keyboard and preparing suitable command sets and potentially in multiple languages can still be a challenge. As natural language processing and voice recognition improves voice will become more flexible as an input type. Combining voice and a pointer input like gaze tracking or hand tracking make a powerful combination as now voice commands can be give context on which to act.

This branches out into voice recognition, natural language processing and agents. The idea of an agent is to provide a user the ability to have a conversation in a specific subject domain and be able to complete tasks for the user. An agent in a car might support route finding or phone dialling.

Mouse : This is one of the most accurate pointing devices and has evolved to include secondary sensors like a right click and a wheel. Clicking and dragging form the basics of most interfaces and are considered the core interactions for pointing inputs. Another useful feature of a mouse is the ability to move over items without clicking, this ‘hover/inspection’ capability can be used to bring up information indicating what is under the cursor and previewing what the result of a click will be through hints like the mouse cursor icon.

Touch, Multi Touch & Near Touch : Touch and multi touch form the basic click/tap and drag function for most interfaces. Near touch improves this with hover/inspect capability. Measuring the area of the fingertip on the touch surface or direct pressure information can be used for enveloping. One challenge with touch is that generally all touch points are stateless and you do not know which hand or user they come from. This means all touches must be treated as equal.

Pen : Pen is generally more accurate than touch for positioning due to the smaller tip, it will often provide hover/inspect capability and pressure for enveloping. Some pens also provide tilt information. Pen and touch will become a very important combination. This is because touch can be used to manipulate (pan, scale or rotate) and pen can act (cut or draw). Shape and handwriting recognition can add value to sketching or filling out forms.

3D input : Consumer systems generally have only one sensor so they generate 3D data from that single point of view. The sensor volume is a trapezoidal frustum starting some initial distance from the sensor and increasing in area until it reaches a maximum distance. The accuracy, noise level and type of information varies based on the technology used. There are three major configurations of 3D input.

1) Full body : Camera pointing at the full body from a distance, narrow field of view,
good for detecting simple skeleton, head and hand position

2) Close range torso : Camera pointing at the users torso, wide angle,
good for detecting head direction, finger positions and hand poses.

3) Close range hands : Camera pointing up at the users hands, wide angle
good for detecting accurate finger positions, angles, hand poses and a stylus


Option 1 struggles to provide full pointing capability because although the position of a hand/cursor can be determined with precision there is not enough information to reliably determine the fingers pose for detecting the equivalent of a click/tap. The workaround for this is to use x/y/z information only. For example the alternatives to a click/tap in order of usability are a) dwell on a button, b) a vertical or horizontal swipe over a button, or c) z push towards the camera on a button.

Dragging is not easily supported, so scrolling and panning is usually managed by moving to the edges of an area and dwelling.

Option 2 can detect hand poses so it is simple and natural to use the spreading/closing of the hand. This means it can provide full pointer support with an open hand being a hover. The closing hand is a pointer down, a moving closed hand is a drag and an opening hand is a pointer up. Scrolling and panning can also benefit from the concept of moving to the edges of an area and dwelling.

Option 3 can detect individual fingers in some detail. This means while you could use the hand just like option 2 you can go further and use each finger.

This works by exploiting the angle of an individual finger to determine it’s up or down state. For example a hand and fingers held flat/parallel over the sensor will have the fingers all in hover/inspect mode (shown as hollow ellipses). When a finger is pressed down at an angle relative to the palm it will be considered active (the ellipse becomes more opaque to indicate it is active/down).


While there is no surface to press against to generate a variable input for enveloping there is tilt and motion information from the finger. So the velocity or angular velocity can be used to determine how ‘aggressive’ this down and up motion was and provide modulation.

Because the technique used to identify the position of the fingertip measures its dimensions and angle, a stylus can be determined as different from a finger. This allows for pen style interactions.

Basic gesture terms


Click/Tap : The idea is a tap is only triggered when a pointer down and a pointer up occur with movement below our default threshold (e.g. 5 mm in any direction).

Release : Similar to a tap but allows the pointer to move anywhere as long as the pointer up occurs on the original object.

Drag : This is where the pointer is placed pressed down and then moved. Drag events should be triggered immediately and flicks calculated as a secondary event. When the user releases the pointer the drag stops. In some cases physics and hysteresis may come into play to continue the motion under momentum or magnetic snapping may be applied.

Hover : This is where the pointer is moved but not pressed. Hover events provide useful inspection capability to give users feedback on the purpose of the item under the pointer. Additional information may only be displayed after dwelling for a set period of time.

Pointerlock : This is a behaviour rather than a gesture. It describes how some objects when manipulated will appear physically locked to the pointer until the pointer is released and it is free to spin/move/slow based on the velocity of the users gesture. Reaching out to press the object again will stop its motion as it locks back on the pointer.

Flick : This is where the user moves in one direction. The thresholds are if the movement is far enough (say 30 pixels) and fast enough (say 30 pixels in 200ms). This event is then raised along with the direction of the flick. This is usually followed by an animation of the object succeeding or failing to be flicked depending on the application.

Pinch : This can generate Resize, Rotate and Pan actions. These can be performed with any two fingers and combined or for some interfaces be measured separately. These should animate the object being manipulated directly. Both contacts being on the same object and within a threshold of proximity will ensure the gesture was intentional.

Dwell / Right click : This is the process of holding the pointer down in one spot below the minimum movement threshold (e.g. 5 pixels in any direction). After a set time (say 500ms) a visual affordance must be displayed to indicate a dwell is being measured. This affordance should show the progress (say for 1000ms) until the dwell time is reached without triggering the movement threshold. Now the action or menu can be shown. The right click is used with a mouse and some pen systems to trigger the secondary menu.

Double Tap : This can be performed with mouse, but is much harder with pen, touch or 3D input. To avoid latency it is recommended to raise the initial tap event and then if a second tap is detected within the threshold (say 300ms) to raise a double tap. This is best used as a shortcut where the first tap selects and the second executes the default action.

Multi User

One of the important concept is that of multiple users not just multi touch. As UMAJIN runs as a simulation with its event driven core multiple users can be interacting.

For example in a photo app users could pick up, resize and rotate photos with perhaps the ability to drop photos into a favourites folder. The challenge with multi user is to minimize or remove modal interfaces. In this scenario, if the first user selects the upload button to for the favourites folder then this should trigger an asynchronous upload progress bar. This is then animated without stopping the second user resizing the photo they are looking at.


“Avoid unnecessary reporting. Don’t use dialogs to report normalcy. Avoid blank slates. Ask for forgiveness, not permission. Differentiate between command and configuration. Provide choices; don’t ask questions. Hide the ejector seat levers. Optimize for responsiveness; accommodate latency. Never bend your interface to fit a metaphor. Take things away until the design breaks, then put that last thing back in. Visually show what; textually tell which.“
Alan Cooper, “about face”