Optimal Interface Part 1: Input

This article is posted in conjunction with Episode 93 of Pragmatic.

I’ve been fortunate in recent years to have tried the vast majority of consumer user interfaces and also the software running on each platform that’s widely regarded as best in class for each interface. I’ve written previously about going Back To The Mac and spoken about using a Microsoft Surface Pro and even tried going Phoneless with just an Apple Watch.

One aspect of my job has been user interface design, conceptualisation and controls and in this series of posts I’d like to explore inputs, outputs and devices in turn, looking at what has worked well and why I think that is as well as what the next inflection points might be.

Part 1: Input

Input to a device from a person must be in a form the person can send to a device and hence has to be via a mechanism we can perform via:

Sound
Touch
Movement
Neural

We shall exclude attempts to convey meaningful information utilising smell by projecting a scent of some kind since that’s not a trick most people can do and likewise for taste.

Sound

The first popular device to perform control inputs from sound was the Clapper. “Clap on, Clap off” to turn lights on and off. Spoken word has proven to be significantly more difficult, with many influencing factors: local accents, dialects, languages, speaking speeds, slurring, variable speech volume and most difficult of all: context. The earliest consumer products that were effective were in the early 1990s from Dragon Dictate, that used an algorithmic approach that required training to improve the speed and accuracy of the recognition. Ultimately algorithmic techniques plateaued until machine learning, utilising neural network techniques finally started to improve the accuracy through common language training.

Context is more complex as in human conversation, we infer much from previous sentences spanning minutes or even hours. For speech input to track context requires consistently high recognition accuracy and the ability to associate contexts over long periods of time. The reliability of speech recognition must be consistent and faster than other input methods or people will not use it. Sound commands are also not well suited in scenarios where discretion is advised, nor in noisy environments where isolating a subject is difficult even in a human conversation, let alone for speech detection by software.

Despite improvements the Apple Siri product ‘feature’ remains inaccurate and generally slow to respond. Amazon Alexa, Google Assistant and Microsoft Cortana also offer varying degrees of accuracy with heavier use of Machine Learning in the cloud providing the best results to date at the expense of personal privacy. As computational power improves and both response time and accuracy improves sound will become the preferred input method for entering long form text in draft (once it keeps up to average human speaking rate of about 150 words per minute) since without additional training on a physical keyboard this is faster and more convenient. Also once these things improve it will also be the preferred method for short commands, such as turning home automation devices on or off for example, for scenarios where no physical device is immediately accessible.

Touch

Touch involves anything that a person can physically push, tap, slide across or turn and encompasses everything from dials to mechanical sliders, to keyboards to touch screens. Individual buttons are best for dedicated inputs whereby that button represents a single command or very similar command, with a common example of a button grid being a keyboard.

Broadly touch can be grouped into either direct or in-direct. Examples of direct movement include light pens, resistive and capacitive touch screens. Light pens needed the user to hold them and they were tethered, slow, and weren’t very accurate. Resistive Touchscreens still needed a stylus to be accurate although some could use the edge of their fingernail, however the centre of a finger wasn’t very accurate. It was also not possible to detect more than a single touch point at a time. Capacitive Touch had better finger accuracy and allowed multiple finger touch detection simultaneously which allowed for pinch and other multi-finger gestures. Although no stylus was needed, to achieve high levels of accuracy a stylus was still recommended.

Indirect inputs include keyboards and cursor positioning devices such as mice, trackpads, trackballs and positioning sticks. Keyboards mimicked typewriter keyboards and have remained essentially unchanged from the first terminal computers through personal computers, apart from preferences for some key-switch mechanisms between users little has changed in decades.

Cursor pointing devices allow for precise cursor positioning with the ability to “nudge” a cursor which is not possible without zooming on a touch interface.

Hence for precision pointing, indirect methods are still more accurate than a stylus due to “nudging”. However precision pointing is generally not a strict requirement for most users in most applications. Non-precision pointing therefore for most tasks benefit from the simplicity of direct touch, which is faster and requires no training making direct touch the most accessible method.

For bulk text input, physical keyboards remain the fastest method however training is necessary to achieve this. Keyboards will remain the preferred bulk text data entry method until speech recognition improves noting that the fastest English typing speed record on a computer is 212 wpm in 2005 using a Dvorak simplified keyboard layout. The average typing speed is about 41 words per minute, hence speech recognition that’s any faster than this at a high degree of accuracy will be the preferred dictation method in most use cases.

Movement

Movement requires no physical connection of the body to the input device and includes gestures of different parts of the body. Some early technology like the Playstation Move ball was a recent example where the user held a device that wasn’t tethered to the machine but directly tracked their movement. Other examples are in Virtual Reality systems that use a handheld controllers with gyroscopes and accelerometers for tracking movement of hands and arms.

The most popular natural free-standing movement tracking device so far has been the Microsoft Kinect that was released for both the PC and the XBox. The movement tracking had issues differentiating backgrounds and was thrown off by people walking past, in front of or behind those people it was tracking at that time. The room size and other obstructions also created a challenge for many users whereby in order to use movement tracking reliably couches, chairs and tables needed to be moved or removed in order to accommodate a workable space within which it would function reliably.

This form of movement tracking is useful for individuals or small groups of people in enclosed environments with no thoroughfare, though the acquisition time of precise positioning even with an Xbox One Kinect 2, was still too slow and the Kinect 2 was discontinued in 2017. The newest development kit for the next generation of Kinect is the Azure Kinect which was announced in February 2019.

Current technology is still extremely inaccurate, easily confused and immature with a limited set of standalone use cases. Extremely accurate natural free-standing position tracking is unlikely to be useful as a mass input device, however in conjunction with speech recognition could provide vital contextual information to improve command interpretation accuracy. It also has applications in noisy environments, where an individual is isolated in front of a device such as a television and wishes to change channels with a gesture without using a physical remote control.

Neural

Brain Computer Interfaces (BCIs) allow interaction through the measurement of brain activity, usually using an Electro-Encephalography (EEGs). EEGs use electrodes placed on the scalp and are cheaper and less intrusive than a Functional MRI (fMRI) that tracks blood flow through different parts of the brain and whilst it is more accurate it is not straightforward.

In the Mid 1990s the first neuroprosthetic devices for humans became available, but they took a great deal of concentration and the results were extremely difficult to reliably repeat. By concentrating intensely on a set thought it was possible to nudge a cursor on the screen in a certain direction, however this wasn’t very useful. In June 2004 Matthew Nagle had the first implant of Cyberkinetics BrainGate to overcome some of the effects of tetraplegia by stimulating the nervous system. Elon Musk invested $27M USD in a company called Neuralink in 2016 that are developing a “neural lace” to interface the brain with a computer system.

It remains extremely dangerous to interface directly with the brain however in order to become useful in future it is necessary to explore since the amount of data we can reliably extract from sensors sitting on our scalp is very limited due to noise and signal loss through the skull. We therefore need implants to directly connect with neurones before we can get data in and out at any rate that will ever be useful enough to overtake our conventional senses.

Attempting to guess how far off that inflection point is at this moment is extremely difficult. That said, when it comes it will come very quickly and some people will decide to have chips implanted and that will allow them to out-perform other people for certain tasks. Once the technology becomes safer and affordable, even then there will always be ‘unenhanced’ people that choose not to have implants however mass adoption might still take a long time depending on rewards vs the risks.

Despite many claims, no one really knows exactly how fast a human can think. Guesstimates are somewhere between 1,000 and 3,000 words per minute as our brains refer to speech however this is very broad. In terms of writing as a task, there’s word-thinking-rate but then when you’re writing something conventionally you will be reading back, reviewing, revising and rewriting as these are key parts of the creative process, otherwise what you end up with is most likely either gibberish or just not worth publishing.

Beyond that there’s an assumption that descrambling our thoughts is possible to do coherently, though more than likely some training will likely be necessary in the same fashion in which we currently have to rephrase our words for a machine to interpret a command initially at least re-ordering our thinking might be required to get a usable result. All this plus multi-lingual people may think words in a specific language or mix languages in their thinking, and how a neural interface could even begin to interpret that is a very long way off and not in our lifetimes most likely.

More in Part 2

Next we’ll look at outputs.

TechDistortion

Optimal Interface Part 1: Input