Categories
Uncategorized

M1 Pro / Max — Apple’s Intel-Crushing Silicon Power Explained!

2 ice storm high efficiency cores. Up to 8 fire storm high performance cores. Up to 32 G13 graphics cores. 16 Neural Engine cores. With up to 64 GB of unified memory and 400Gbps bandwidth to keep it all fed. A new display engined not just to drive XDR displays, but multiple XDR displays, a third thunderbolt and USB controller for more I/O, a new media engine for super fast, ultra efficient H.264, HEVC, and ProRes encode/decode. Up to 57 billion transistors. And maybe, just maybe, our first glimpse at what’s coming next for the iMac Pro and full-on Mac Pro.

I’m Rene Ritchie, hit subscribe so you don’t miss the next video, and this… is the M1 Pro and M1 Max deep dive!

Scalable Archiecture

X as in Extra

Apple’s been making ‘extra’ — as in extra but also as in just totally extra — versions of their custom chipsets almost since forever. Their first SoC, or system on a chip, was the A4 in the original iPad and iPhone 4, followed by the A5 in the iPad 2 and iPhone 4s. Now, an SoC just means most everything is integrated into the same die. So, instead of having a silicon… platter… with CPU over here, GPU over there, memory on the left, controllers on the right, you have a silicon sandwich with all the cores, all the features, all stacked together. There are a ton of advantages to this approach, which I’ll get to in a minute, but one of them is scalability. Not just generation over generation, as new architectures and processes are introduced, but even within the same generation as extra cores and extra features get added.

Enter the A5X

Where the OG A5 had dual ARM Cortex A9 CPU cores, dual Imagination PowerVR SGX543MP2 GPU cores, and 512 MB of package-on-package RAM, A5X kept the same CPU but escalated the GPU to quad core PowerVR SGX543MP4, doubled the size of the memory interface, and the amount of RAM… to 1 GB… but shifted it off-package, which, I don’t know, maybe could be something again in the future…

Apple needed those extra GPU cores and memory to power their first-ever tablet-sized Retina display for the iPad 3 aka The New iPad. Even though, turned out, only barely, and Apple ended up having to get the iPad 4 and A6X out just 6 months later. And let me know if you want to hear more about that whole story in the comments!

A6X was similar to A5X, keeping the same dual CPU cores, though this time, they weren’t ARM Cortex designs but Apple’s first custom Swift CPU cores, and it took the A6’s triple core PowerVR SGX543MP3 GPU to a quad-core PowerVR SGX554MP4 GPU, and the memory to quad-channel.

So, A4, but no A4X. Then A5X and A6X, but no A7X. That’s right, Apple stuck with their first 64-bit chipset, the A7, in its origin forme not altered forme, for the first iPad Air. The iPad Air 2 though, yeah, that got an A8X. Instead of dual Typhoon CPU cores, it had three. And instead of a quad cluster customized PowerVR Series 6XT, it had an octo cluster, and again with an external RAM module. 2GB worth.

There was also an A9X and an A10X, the latter of which was part of the first generation to use Apple’s fused version of big.LITTLE, or performance and efficiency cores. Triple Zephyr e-cores and triple Hurricane p-cores, to be exact, along with 12 customized PowerVR GT7600 GPU cores.

No A11X, because by the time the iPad Pro came around, Apple had fallen into less of an every 12 months and more of an every 18 months cadence for upgrades, but yes A12X. Which was the big one, because it most directly set the stage for everything that would come with M1.

Bargain Binning

A12 was Apple’s second generation Bionic architecture, which unlike the paired Fusion architecture before it, could use any or all cores separately or together. In other words, multicore wasn’t just 2 e-cores or 2 p-cores, it was the 2 e-cores plus the 2 p-cores. All 4 cores together, like Voltron.

4 Tempest e-cores and 2 Vortex p-cores to be exact, along with 4 custom G11 graphics cores, and 8 neural engine cores, or ANE. Also, custom encode/decode blocks for H.264 and HEVC, which I’ll get to in a minute because they’ll become a much bigger deal with with the M1 Pro and M1 Max as well. And, Apple’s increasingly secret sauce — their performance and, soon, machine learning controllers.

A12X kept the same number of e-cores but doubled the p-cores to 4, and almost doubled the GPU cores… almost. See, at the time, Apple announced 7 GPU cores on the A12X for the 2018 iPad Pro, but it turns out there were actually 8 cores, Apple was only making use of 7. They didn’t start using all 8 until the A12Z, the second iteration of that SoC, for the 2020 iPad Pro.

A12X also had 4GB of integrated RAM for most models, but 6GB for the highest tier model, which required it to support 1 TB of storage. A12Z, though, had 6GB for all storage tiers. And all of this, from binning to memory levels, were all things we’d start to see really play out for the M1… but especially for the M1 Pro and M1 Max.

Which, no surprise, because the A12Z also just so happens to have been the chipset they used for the Apple Silicon Mac dev kit — the iPad guts in the Mac mini case intended to help get apps ported over and ready for M1.

And yeah, there was no A13X. Even though Apple introduced the A13 for the iPhone 11 back in September of 2019, they were still perfectly happy to ship the A12Z for the iPad Pro and Dev Kit in 2020.

Same way there was no A14X… because it essentially became M1… and even though Apple introduced the A15 for the iPhone 13 back in September of this year, Apple was likewise still perfectly happy to ship the M1 Pro and M1 Max for the new MacBook Pros just one month later. And wow are they extra. Like triple X as in Extra.

M1 Pro & Max CPU

M1 is more than just A14X with new branding, of course. It has specific silicon IP for the Mac. But that didn’t stop Apple from following up on the A12Z iPad Pro with the M1 iPad Pro. Because the architecture was and is so broadly similar between the generations. 4 e-cores, 4 p-cores, 8 GPU cores, but 16 ANE cores now instead of the 8 on A12X and Z. And fabbed, or fabricated on Taiwan Semiconductors’ 5 nanometer process. Giving it even greater performance efficiency.

Instead of A12 generation Tempest and Vortex cores, M1 has A14 generation Icestorm and firestorm cores for the CPU, which provided a really good balance between that efficiency and performance for Apple’s initial wave of ultra low power Macs — the MacBook Air, 2-port MacBook Pro, Mac mini, and redesigned 24-inch iMac. A whole line up, from iPad Pro to iMac non-Pro, ultra-long lasting portables to ultra-low thermal desktops. Talk about your scalable architecture.

But with M1 Pro and M1 Max, Apple wasn’t as concerned with ultra low power. What they needed to deliver was ultra high performance. So, instead of 4 e-cores, they dropped those down to just 2. Bigger batteries and adaptive refresh rate displays would offset any real differences there anyway. And then they bumped the p-cores up from 4 to 6 or 8 for the M1 Pro and a solid 8 for the M1 Max.

The 6 p-core version in the M1 Pro being a binned down version. Same as what Apple did with A12X and even M1 on the GPU side. See, when monolithic chips like Apple’s SoCs come off the fab, especially on leading edge process nodes like TCMC’s 5 nanometer, there can be defects and some of the cores can be non-viable. If you just throw away every chip without a full set of perfectly functional cores, you end up with a lot of waste, which means low yield, low volume, and a high price per remaining unit. But, by keeping the ones with only 7 out of 8 GPU cores, or 6 out of the 8 p-cores, they throw away fewer chips, which means a better yield, which keeps volume up and prices down per unit. Then Apple passes on some of those savings to people who are fine buying less cores if it costs them less money.

So, yes, both M1 and the binned down version of M1 Pro have 8-core total for the CPU. But where M1’s 8 cores are the sum of 4 e-cores + 4 p-cores, the binned down M1 Pro’s 8 cores are the sum of 2 e-cores + 6 p-cores.

In other words, instead of 4 Toyotas and 4 Ferraris, you’re getting 2 Toyotas and 6 Ferraris. Which is more Ferraris.

And then the regular M1 Pro and the M1 Max both have 10 cores total for the CPU. The sum of 2 e-cores and 8 p-cores, or 2 Toyotas and 8 Ferraris. Which is even more Ferraris.

Why 6 or 8 p-cores for the M1 Pro instead of 7 or 8 like the GPU cores for the regular M1? It might just come down the realities of the fab, or it could have to do with the 8 p-cores actually being 2 clusters of 4 p-cores each. Also, each clusters has their own 12MB L2 caches, and each cluster can dynamically clock their CPUs independently, meaning a single active core on each can go all the way up to 3.2GHz, two cores can cut down to 3.1GHz, and 3 or all 4 cores, down to 3GHz. Sacrificing a little serial speed for a lot of parallelism.

The 2 e-cores are clocked at 2Ghz, but get the same 4MB of L2 cache that the M1 has for its 4 e-cores. On top of all that, where M1 has 16MB of system level cache, M1 Pro has 24MB and M1 Max, 48MB of SLC.

So, each individual e-core and p-core are the same, meaning any single core task will perform the same on M1 or M1 Pro or M1 Max. Like driving any one individual Toyota or Ferrari. But there are more p-cores even on the binned down M1 Pro, and many more on the regular M1 Pro and the M1 Max, meaning any multi-core task will run just that much faster. Because so damn many Ferraris. And that’s not even counting the improved memory system, which I’ll get to in a minute.

And that’s the first way Apple’s M1 Pro and M1 Max feel so fast. Just the overall speed of the cores. Everything gets done faster.

But because they’re all still Apple cores, not Ferrari’s, and those cores that have to fit in the tiny thermal envelopes of iPhones and the relatively small thermal envelopes of iPads, even the performance cores are still widely efficient. Which is just the starkest of contrasts to the previous Intel chipsets, which just… chugged power rather than sipped it, and hit thermal max pretty much at startup, only to ramp up and down… incessantly, constantly, thereafter.

At just 30 watts, fully fired, inside the relatively roomy chassis of the MacBook Pro, the M1 Pro and M1 Max CPUs can sustain… pretty much indefinitely.

And if you’re worried about battery life, you can turn on the new low power mode in macOS to maximize the efficiency. Conversely, on the 16-inch MacBook Pro, because the even bigger thermal envelope, you can turn on high power mode. That lets the fans and chips loose, so you maximize the performance.

It’s a cool idea made possible precisely because M1 architecture is so cool.

M1 Pro & M1 Max GPU

Carrying on the theme, where M1 had 7 or 8 slightly tweaked A14 generation G13 GPU cores, M1 Pro has 14 or 16 of those GPU cores and M1 Max… 24 or 32. And that’s just… such a ridiculously massive escalation. To help put it in context, M1 had 16 billion transistors. M1 Pro has 33.7 billion and M1 Max… a brain-blowing 57 billion, with all those GPU cores being a significant part of that budget.

But Apple’s always leant heavily on the GPU for everything from interface acceleration down to the literal core graphics and core animation, to things like the old OpenCL and the new Metal APIs.

And doing it this way, Integrated vs. discrete in a laptop, really turns out to be more than just an implementation detail. Especially when you’re talking SoC sandwiches rather than old fashioned board platters.

Because, Apple is keeping the CPUs fed with 16 to 32 GB of LPDDR5 memory on the M1 Pro and a whopping 32 to 64 GB on the M1 Max. Which, yeah, sure, isn’t anything new or novel for a MacBook Pro CPU, but because of the SoC architecture, that RAM isn’t just for the CPU, it’s a massive memory pool that’s also available to all the other compute engines, including the GPU. That’s compared to the traditional board architecture where the GPU might have 8 GB of dedicated VRAM if you’re lucky, 16 at the highest end, highest performance. And here it gets up to 64 GB. Which is just unheard of on a laptop.

And to keep that all fed to the GPU, Apple’s opened up on the memory bandwidth. All the way up. M1 is doing 70 GB/ps. M1 Pro is doing 200 GB/ps, and M1 Max… a jaw-dropping 400 GB/ps. And you guessed it, because of the unified architecture, the CPU and other compute engines also get access to all that bandwidth, which is also unheard of. Just… unheard of things all the way down.

It’s the second way M1 Pro and M1 Max feel so damn fast — the instant responsiveness afforded by that unified memory system and overall architecture. It makes the Mac feel as utterly instant as the iPad, even more utterly… instant-er now.

Also, where a company like Nvidia essentially abstracts away the computer into an interface for their CUDA cores, Apple’s Metal frameworks abstract away the GPU instead, so anything written against previous Intel or AMD graphics will work on M1, Pro, and Max GPUs, and because Apple’s GPUs are so damn good, chances are they’ll just work better. Massively better.

And even though M1, M1 Pro, and M1 Max vary so much in capability, scalable architecture means they present as very, very similar targets to developers. I mean, Apple had to do a ton of work with the fabric that brings together and binds all these core and all this RAM, but anything already written for M1 will just fly on M1 Pro and… go full on orbital on M1 Max.

And again, because these GPU were designed for performance through efficiency, and have to scale from the iPhone 12 to iPad Air to iPad Pro to MacBook Air to MacBook Pro to iMac, they still only just sip power.

Even firing CPU and GPU and… basically everything… the M1 Max flat out uses slightly less power than the 100 watt baseline on an Intel Alder Lake CPU, which can also reach over 300 watts when overclocked — as much as a giant hellicarrier looking Nvidia Ampere card.

Put those two things together, and even in a desktop, where Intel + Nvidia would require near-cryogenic levels of cooling, Apple could easily throw multiple M1 Max dies into even smaller, thinner enclosures and still offer ridiculous levels of performance.

And you better believe I already have a video up on just exactly that, linked in the description below the like button.

M1 Pro & Max Media Engines

I’ll get to the media engines in a supremely hot second, but in addition to the GPU, the M1 Pro and M1 Max have a third USB and a third Thunderbolt controller, which not only lets them power more ports than the original M1, but more displays. Up to two 6K displays with the Pro and three with the Max, in addition to a 4K TV over HDMI.

That is HDMI 2.0 not 2.1, which has much higher bandwidth. But it’s because when these chips were being specced out a couple years ago, HDMI 2.1 was even less of a thing than it is now, and Apple figured that I/O bandwidth should go to a third Thunderbolt port for more Pro-centric displays, rather than faster HDMI for then bleeding-edge TVs. Same with the SDXC card slot. Feel utterly free to quibble in the comments about that, but they’ll eventually just amp up the I/O in a future generation.

Now, those media engines.

Apple’s been adding custom encode and decode blocks to their silicon for years. And, honestly, hardware acceleration for video playback isn’t at all uncommon. Video transcoding has been a little more hit and miss, but not by much. And over those years, Apple has added support for H.264, the original 1080p standard, and H.265, aka HEVC, the 4K standard. Also Google’s alternative codecs, including the current VP9.

Apple even switched from the original T1 chipset in the Mac, which was a repurposed S2 system-in-package from the Apple Watch, to the T2 chipset, which was a repurposed A10 Fusion, in part because Intel failed to deliver H.265 encoding in a timely fashion, and Apple’s iPhone older iPhone chip could just do it faster and way more efficiently than leaving it CPU or offloading it completely to some of the GPUs.

That’s why Apple Silicon Macs don’t have T2 or T-anything chips any more. Everything that Apple had to work around Intel to provide, including secure enclaves for Touch ID, and now Neural Engines for machine learning acceleration, all the custom controllers, and yeah, the media engines, are all already built into M1. Because where T2 was an A10… M1 is an A14. And that’s how Apple silicon and SoCs work.

So, M1 has those A14 media engines for H.264 and HEVC, among other things. But what M1 Pro and M1 Max add are ProRes video engines.

Which is… not exactly a first. When Apple introduced the current Intel Mac Pro back in 2019, they introduced a reprogrammable ASIC card along with it. Branded Afterburner, it was a ProRes and ProRes RAW accelerator that could handle up to 12 streams of 4K or 3 streams of 8K.

Then, just a couple months ago, Apple introduced the A15 Bionic for the iPhone 13, which for the iPhone 13 Pro included an extra G14 GPU core and… a ProRes media engine. That’s what lets the iPhone 13 Pro shoot ProRes 422 HQ, including a new storage controller than can write those massive 6GB per minute files to the SSD without skipping a frame.

Now, Apple’s also brought those ProRes engines to the M1 Pro and M1 Max, which is super interesting for a couple of reasons.

First, because it means Apple isn’t restricting features to specific IP generations. In other words, A15 generation ProRes Engines can show up on A14 generation M1 Pro and M1 Max chipsets. Apple cares less about abstract numerical branding sequences and more about delivering the capabilities they need to deliver in the most economical, efficient, and performant way possible.

Or as the silicon team says, their one job is to run iOS and macOS and apps faster than anything else on the planet, constrained only by time and the thermal envelope of the device, and the rest is all just implementation details.

Second, because moving them off an ASIC board like Afterburner and putting them on the SoC is again like moving them off the platter and putting them in the sandwich, so they have the same immediate access to that huge pool of unified memory and bandwidth. Which just makes them even faster.

Third, that Apple is putting such a focus on video capture and production this year. Because these new engines let you not just capture ProRes on your iPhone 13 Pro, but edit it with jaw-dropping speed and efficiency on your MacBook Pro.

Forget 12 streams of 4K on Afterburner, M1 Max can handle 30. And just sit down with your 3 streams of 8K. M1 Max can handle 7.

That’s thanks to M1 Max not just having one ProRes encode and one ProRes decode blocks like M1 Pro, but two of each. And two H.264 and H.265 decode blocks as well.

And sure, Apple could have continued doing ProRes on the CPU, like they did with Intel Macs in the past, but moving it to dedicated silicon meant they could do it faster, with less power draw, and in a way they left the CPU free for other tasks.

And that’s really important. Because, prior to M1 Pro and M1 Max, when you went to render ProRes, it could thrash the CPUs, meaning anything else you tried to do at the same time was maple syrup on snow slow — I’m Canadian, you know what I mean — and made the render slower as well. Like almost untenable. Now though, you can hit render, and only the ProRes engines get thrashed. You can keep working away on the CPU as if nothing else is happening. Almost like you have a second Mac ready and waiting for you while the first one’s off exporting your video.

And to see why they’re willing to spend their transistor budget like this, I have a whole entire video up for you with Apple’s VP of Silicon and VP of Mac Product Marketing where they explain just exactly why. Link the description below the like button.

But it’s the third way M1 Pro and M1 Max just devastate on speed. Yes, it’s the pure performance of the cores, and yes it’s the utter responsiveness of all that unified memory and bandwidth, but it’s also those off-core features that are essentially giving us multiple parallel pro workflow engines in one.