Alexa vs. Siri
Brian: Who are you talking to these days? Siri? Alexa? Or both?
Quinn: Siri only…I unplugged our Alexa recently. It was a combination of constant false prompts (something my wife, Jennifer, says sounds like “Alexa”), and the thought of audio transcripts from our apartment being stored on a server somewhere.
Brian: Now I’m curious about what word Alexa is getting confused by, but let’s not get distracted. I also talk to Siri, but for me it’s more about seeing confirmation on a phone or watch screen that she understands me. I like seeing the results of my query, not just hearing them. Perhaps it’s because I’m an interface designer—it just makes me feel good.
Quinn: My interactions with Siri are more productive overall because they’re in the context of an app which, while virtual, is still more tangible. I can hold it in my hand and physically interact with it as part of the conversation.
Brian: It sounds like we’re both pro Siri over Alexa. Who do you think is going to win the home? I’m stubbornly holding on to cable, and the voice search on my Xfinity remote is incredibly accurate — it even picks up my voice at a whisper. I can’t remember the last time I actually used the keypad.
Quinn: Wait, what’s cable? Similar to your TV experience, I love talking to my Apple TV remote. The TV just gives you a solid, holistic interface. So many things we want to do on the TV are decoupled from complex finger gestures or geolocation. It’s a simpler interface overall, but small verbal interactions lead to big screen effects, which feels powerful.
Brian: Yeah, there’s something very powerful about your voice controlling data display. That’s why I think Siri (or Google) is going to win over Alexa. Alexa’s hardware is super-limited and doesn’t do browsing particularly well. She’s great for specific tasks like “What’s the temperature?” but she’s woefully unprepared to help me figure out what I want to watch or discover something new. I think the future of CUI will be some kind of hybrid interface between voice and touch, almost a seamless connection between the two.
Quinn: Right, collaboration between different types of UI when it makes sense, lone wolf at other times. Right now Alexa’s biggest limitation is that it’s in one place. Even in a 700 square foot Brooklyn apartment, she’s out of range a lot of the time. My iPhone is generally well within Siri range and, when it’s not, my wrist is almost always by my side, Apple Watch strapped on and waiting.
The Acceleration of Machine Learning
Quinn: Another factor that’s going to amp up CUI is the rapid acceleration in machine learning. The jump from today’s scripted interactions to Her-like conversations is only a matter of scale, dependent on computational power that is continuing its trajectory of increased capability at decreased cost.
Brian: Let’s talk about machine learning. I’m not a developer, but for the record, what the hell is it exactly? A prelude to Skynet?
Quinn: Nothing quite so ominous…I think. At a high level, machine learning just means a computer is doing things without being explicitly coded to do those things. The idea of teaching a computer how to understand any possible sentiment expressible by an English sentence is — let’s just say the scope of such a project would be large.
The same string of letters can have vastly different meanings depending on their context. But, if you have a few hundred million English sentences that have some connection to meaning or sentiment, then you could feed those into a big math factory and produce information that could allow a computer to understand new sentences and estimate what they mean.
Humans hoard data, that’s what they do. There are some amazing examples of leveraging information that would be impossible for humans to interpret, because of the vastness or complexity, to create practical, everyday benefits to humans.
Brian: Let’s say I ask Google to translate “I love the smell of napalm in the morning” into Spanish. Is that “thinking” happening on my device?
Quinn: Most of what is happening for that query is going on remotely in Google’s server plantations. It’s lightweight to send a string of text across a timezone or two where it may be routed through many different systems that classify the question and come up with the best guess of an answer. Making this work on a device is feasible, but we’re not to the point of having a mobile device ready to respond to all of that. If we knew it was constrained only to “convert this English sentence to Spanish” we’d be in the ballpark.
In general, a specific task, even one that seems amazingly complex like identifying objects in photographs, is very fast to run but incredibly time-consuming to train. After the investment of training is done, it might require on the order of microseconds to analyze an image on current mobile hardware.
Brian: And that training is done with brute force by actual humans?
Quinn: Let’s say we want to train a computer to be able to tell if a photo contains a puppy. We would start with a really large group of photos and manually tag the ones that have puppies in them. We’d also create a black box of elegant mathematics that allow the computer to gradually form its own criteria for detecting puppies. And then we’d feed the photos into the black box.
As long as we give the computer enough good information in the first place, in the form of tagged photos, the next time we feed it photos, it will be able to guess which ones contain puppies. The larger the original group of photos, the better it will do.
Brian: What about Siri’s trick of finding a specific quote in a movie? Same process?
Quinn: Finding a quote in a movie is more of a coding sleight of hand. All the dialog of a movie ships with it in the closed captioning information in addition to the timing information of where it is in the video.
Quinn: There are two main ways that machine learning is enticing to me, depending on whether I want to be lazy or stupid. I could code a computer to solve a problem, but I’d need to write hundreds of millions of lines of code to do it. I’m going to call that lazy: I could do it, but nobody could afford it.
But if I’m tasked with identifying whether or not a single photo has a bird in it or not, I wouldn’t know where to begin. I’m too stupid to understand how to make a computer do that with traditional code. In the first case, machine learning is accomplishing something humanly possible without humans having to do it. In the second case, machine learning is doing something its creators cannot accomplish, teaching itself how to see birds.
Brian: “Seeing birds”…can we talk about what that means for designers? It feels like we’re fast approaching a place where navigation might no longer be necessary. Instead of tapping down a human-designed pathway, we’d have a brief “conversation” with the device either verbally or through a keypad and then presto, I see a subset of content that is hopefully close to what I’m looking for. Interfaces — assuming there’s a screen component—will need to be far more fluid and less hierarchical. The hamburger menu, love it or loathe it, is not long for this world.
Giving the user a sense of context in the system, that’s the real trick. The nice thing about good old-fashioned navigation is that it gives you a sense of the entirety of the system and where you are in it. CUI obscures both scale and context to a degree.
Quinn: What’s tucked behind that hamburger menu is actually more powerful and capable because of machine learning. Take NYC’s greatest challenge: finding an apartment. The variables, the speed of change, the human constraints, the sheer number of apartments…it’s very daunting. If we ask an app via tapping gestures or with our voice to find our next great place, what happens next has the greatest potential to delight or disappoint us. If machine learning can make that process easier, better, more comfortable, or certain, I’m all in. From the UI of it, though, it’s safe to say I’d never Amazon Prime myself a new apartment having never seen it.
Brian: I’m going to remind you of that next time you move.
The Fast and The Fuzzy
Brian: There will be a time I’d imagine where navigation only exists as a redundancy.
Quinn: That’s an interesting point. The ubiquity of search belies the underlying complexity of what happens when you hit return. I blame Google, because they do it so amazingly well. And it works so well because of a zillion coder hours of amazing effort.
Without a lot of effort, implementing search will give us hard edges, not the organic, messy, fuzziness we’re used to as humans. We’ve come to expect uncannily accurate Google-like results, but we’re still used to home-grown search functions on web and mobile that give us literal, black-and-white results. Those results coming from a voice interface will feel broken, since, until very recently, voice has been the sole domain of human-to-human interaction.
Brian: Back to the idea that Alexa is basically stuck in one place…that’s not much help is it? I much prefer my watch as a voice activation tool because it’s with me almost everywhere I go. How long before we see an Alexa watch or broach? Or a tiny drone that follows you around?
Quinn: 132.7 days. Give or take, of course. When you’re competing with something like an Apple Watch, the problem isn’t the hardware, which is relatively easy, but making an application development ecosystem. An ecosystem that will make designers and developers drop what they are doing and learn something new to develop applications for a relatively small user base.
Brian: Apple and Google are the obvious leaders there. Not sure Amazon will ever be able to close that gap, before the Drone Wars I mean. Do you suspect that people will have an inherently different tolerance level for conversing with machines trying to pass as humans than they do with actual humans in front of them?
Quinn: While people are more than capable of being jerks to other humans, there’s at least some level of humanity in interactions between two humans. If someone doesn’t understand us, we’re likely to cut them some slack, human to human. When a computer doesn’t understand us, I think we’re a little quicker to frustration. Until we get advanced enough on the technical side to beat humans in understanding and responding, it seems like knowing our limitations could help inform the design, copywriting, and failure handling of conversational UIs.