Live Text — Apple’s Best New Feature Explained!

Live Text is one of the absolute best new features coming to the iPhone with iOS 15, iPad with iPadOS 15, and Mac with macOS Monterey this fall. And, as of this week, it’ll be also be coming to Intel Macs as well as the M1!

I’ll give you a personal example first. I had a family member message me from the UPS store the other day. They needed an address to send a package. I’d sent something to the same place a year ago, but never bothered to save the address. But I had taken a photo of the UPS label. So I searched for it, found it, tapped it, and was immediately able to swipe along, select all the text from the address, copy it, paste it into a reply, and send it back.

I’ve been doing that with screen shots all month as well, to the point where I can’t tell if I’m in Photos or in Safari any more… until I try to tap the address bar…

At WWDC, developers were tweeting about grabbing code samples from the slides and pasting them right into projects as they were watching State of the Union or a session, live. Which blows my mind.

And those PDFs… those PDFs… where there’s no text layer, just an image burned in, the ones that were previously inaccessible on their own… yeah… just all the text that’s locked into all the images on all your Apple devices is now… unlocked. And not just clear type, but hand writing, signs, billboards, white boards… All text, always. Unlocked. Forever. At the OS level.

And because it’s Apple, and they have more silicon power per square nanometer than anyone else on the planet, including up to 16 neural engine cores in the most recent devices, they’re just f’-it, let’s do it all live.

Not process it on load, not batch it over night, but do it all in real time, all the time, straight from the camera. So you don’t even have to take a photo to grab all the text in a scene, you can just point, tap, and select.

For the clear text, that’s not a huge challenge. OCR or optical character recognition has been a thing for ages. The less clear the text, though, the huger the challenge. Start adding deformations like angles and perspective, and blur from depth of field, and the difficulty intensifies. Get into hand writing, and it moves to another plane of existence entirely.

But Apple had already been working on precisely that problem, starting with scribble on the Apple Watch, which let you draw out characters to reply to text messages. And not in an old-school Palm Graffiti style where we humans had to adapt to the limitations of them machines, but in a completely human style where the machines had to learn to fully adapt to us.

The bigger leap forward though was Scribble coming to the Apple Pencil and iPad last year. With that version, Apple was taking anything you wrote, converting internally to text, and then making it selectable and actionable within iPadOS.

It’s trained through machine learning as well.

If you’re not familiar with that process, the best way I can describe machine learning is… think Tinder for bots. You feed them options and it swipes yes, no, no, no, yes, yes, yes, no, yes, no, hotdog!

It’s not like programming in the traditional sense. It’s more like training a pet. Which, when I first heard the process described to me back with the introduction of Face ID, was amazing… and terrifying.

Because you get into this whole thing of antagonistic neural networks, where you have one… batman-type hero algorithm trying to get better and better at the yes no swiping and another, Joker-type villain algorithm trying to fool it. And they just battle away inside the machines, with no one really knowing what they’re doing anymore, just that they’re evolving and getting better and better at it. Just, Hunt the Dark Knight or the Killing Joke deep inside there, constantly. It’s so cool and so damn scary.

But, I digress, first for Scribble and now for Live Text, specifically, Apple fed the machine learning models a ton of handwriting samples. Trained them on as much as they could. Then they took those samples and deformed them. Angled them. Curved them. Skewed them. Broke them. And then they fed them again. And then deformed them again. And fed them again. Over and over again. Until the neural network could identify a wide enough variety of handwriting, accurately enough, for Apple to consider it baked enough to ship, at least for the beta. No doubt it’ll continue to improve over time.

And because Apple has these neural engines in so many devices now, they can run these models on-device, which means not only is there very little impact on anything else the system may be doing at the time, because the CPU and GPU aren’t involved at all, but it can all be done on-device, so none of the text is ever being sent to Apple’s servers or operated on in the cloud. Which is exactly the kind of privacy-by-design model Apple’s been bludgeoning the industry with as a competitive advantage for the last few years.

And, because the M1 basically brought Apple’s iPhone and iPad silicon to the Mac — the M1 is like an A14 on Hulk serum, or A14 is like an M1 jr., however you want to look at it — the whole thing just works on M1 Macs as well.

And while I’ll never say anything in engineering is free or trivial, the scalable architecture Apple’s been building out for the last many years does now mean it’s much easier to bring iOS and iPadOS features to the Mac, day and date. Something Apple has seriously struggled with in the Past.

At least to the M1 and future M-series Macs. Which brings us back to the Intel announcement this week.

When Apple initially listed Live Text and M1 only, Intel owners got mad. Nobody likes to feel left out, and Mac owners in particular… they’ll cut you.

So, Apple’s gone back and spent some engineering time bringing the feature over to Intel as well. At least in a functional if not exact way.

What I mean by that is… Apple is typically…. completely overzealous, like all Maud’ib about doing things in real time. Have silicon, will wicked flex.

But unlike M1, Intel Macs just don’t have neural engine cores. Even the ones with T2 chips, because those are basically A10 chips from just before Apple silicon went… bionic. And that’s especially true about almost anything with the camera. Apple wants it done in real time so it feels like a real camera, not like a filter being applied after the fact.

But Macs don’t have the same kinds of camera systems as iPhones or iPads. So, Apple’s cutting that Gordion Knot by relaxing their real-time rule for Intel Macs, and pushing Live Text off to the GPU.

Because it’s not doing the camera part, and just operating on the text opportunistically, you’ll probably never notice a delay or any overhead on any other process on an Intel Mac. You’ll just get something almost indistinguishable from the M1.

Everybody wins, including the engineers pulled off whatever else they were doing to push this through in time for the latest beta.