Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Vadim Yuryev

@VadimYuryev

Apr 13, 2022 • 25 tweets • 6 min read • Read on X

@hishnash

Exclusive: Apple's M1 family of chips comes with a design limitation that was overlooked by Apple engineers when they started working on the chips 5-7 years ago.

The bottleneck is the 32MB TLB. Thank you to @hishnash for the help in explaining how it works.

Let me explain.. 🧵

Problem: Apple shows that the M1 Ultra GPU can use up to 105W of power. However, the highest we could ever get it to reach was around 86W.

No, the Mac Studio cooling wasn't a problem because the GPU stayed cool, around 55-58°C compared to in the past when Apple allowed 100°C.

This makes it pretty clear that the Mac Studio cooling system is OVERKILL in most apps, which means that there was a disconnect between Apple's Mac Studio cooling system engineers and the M1 Ultra chip designers/engineers. Something has gone terribly wrong in terms of chip perf.

Culprit: Each cluster of GPU cores within an M1/M1 Pro/M1 Max/M1 Ultra chip comes with a 32MB TLB or Transaction Lookaside Buffer, which is a memory cache that stores the recent translations of virtual memory to physical memory, used to reduce user memory location access time.

Hishnash: "If an application has not been optimized for the M1 GPU architecture's tile memory, (not just Metal optimized) then every read/write needs to go all the way out to system memory. If the GPU compute task is issuing MANY little reads, then this will saturate the TLB.

The issue is if GPU data hits the TLB and the page table being read/written to is not loaded, then that entire thread group on the GPU needs to pause while the page table is loaded into the TLB. If your application is using MANY reads/writes per second, this results in...

...a lot of STALLED GPU thread groups. Unlike a CPU, when a GPU is waiting for data, it can't just switch to work on something else. So the GPU sits there and waits for the TLB buffer to clear in order to get more work to process." This is why we only saw 86W peak GPU usage...

..in an app that was considered to be decently optimized. However, for apps that CLAIM to support Apple Silicon support but have NOT been rewritten to take advantage of Apple's TBDR tile memory system, they will be severely limited by the 32MB TLB if there are many reads/writes.

The problem is that ALMOST ALL apps out there haven't been optimized for Apple's TBDR tile memory system. Many software developers simply get it to work using the traditional TBIR model and call it good to go, being unaware of the 32MB TLB limitation that bottlenecks performance.

Hishnash: "What apps should be doing is loading as much data as possible into the tile mem and flushing it out in large chunks when needed. I bet a lot of the writes (over 95%) are for temporary values that could've been stored in tile mem and never needed to be written at all."

Hishnash: "I expect that the people building the M1 family of chips didn't expect applications to be running on it that are not TBDR optimized. So they thought 32MB would be enough."
WRONG. Most apps aren't optimized for Tile-mem, even if they claim it supports Apple Silicon.

Keep in mind that between the time when Apple started engineering the M1 family 5-7 years ago, reliance on GPU performance has skyrocketed, so the chip designers probably didn't think there would be so many reads/writes to the 32MB TLB.

What does this mean? The M1 family of chips, including the M1 Ultra, has a major limitation that can't be fixed unless apps are properly optimized. Here's the problem.
Hishnash: "The effort needed to optimize for tile memory is MASSIVE. It requires going all the way back to the..

...drawing board, re-considering everything, like the concept that there is a local on-die memory pool you can read/write from with very very low perf impact is unthinkable in the current desktop GPU space. It’s a matter of a complete rewrite at a concept/algorithmic level."

Why is this such a big problem for M1 Ultra?
With the M1 and M1 Pro chips, there wasn't enough GPU performance to hit that 32MB TLB limit. However, the M1 Max is where you see GPU scaling fall off a cliff due to the TLB, especially the 32-core GPU model.

This problem scales linearly, so if, for example, 26 cores is the sweet spot for the M1 Max, with the rest of the 6 cores being bottlenecked by the TLB, the M1 Ultra will be bottlenecked by 12 GPU cores because it features two 32-core M1 Max dies. No wonder it scales poorly.

@hishnash

The solution from @hishnash: "Increasing the TLB will help a lot for applications that are not optimized. This is important because many apps will NEVER be optimized, and even fewer games." This is why gaming performance is so poor on M1 Ultra, apart from the Rosetta bottleneck.

Hishnash: "For game engines that are not TBDR aware/optimized, they might be currently bottlenecked on reads.. and depending on the post-processing effects, might have some large bottlenecks on writes if they're not using tile memory and tile compute shaders where possible."

The reason World of Warcraft runs so well and compares well to the RTX 3080/3090 is because APPLE helped them optimize the game PROPERLY to take advantage of the new TBDR tile-based architecture. (WoW Metal Update Released one week after M1 event proves they got help from Apple.)

The solution from our source: Future M-chip families (Hopefully and probably M2) will see a big increase in the TLB to solve this problem since developers are likely to be slow in optimizing apps. Apple will likely release white papers at WWDC on how to optimize apps properly.

This means that the M2 Ultra will see a HUGE boost in GPU performance over the M1 Ultra if the 32TLB bottleneck is removed. And that performance boost will be on top of higher clock speeds and potentially higher GPU core counts.

The only hope for the M1 Ultra is that developers finally decide to completely rethink and rewrite their apps to support the TBDR tile-based memory architecture. (Good luck)
Oh, and by the way, expect hardware ray-tracing support on future M-chip families. (Hopefully M2 Pro+)

@hishnash

Once again, thanks to @hishnash for all of the help in figuring all of this out.
You can check out our YouTube channel Max Tech here: youtube.com/c/MaxTechOffic…
Or buy an M1 Ultra T-shirt to support us (Use promo code M1Ultra for 20% off all of our merch): max-tech-store.creator-spring.com/listing/apple-…

What does this mean?

Not sure if this helps explain or add to the conversation but I'm gonna add this comment from YouTube.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @VadimYuryev

Vadim Yuryev

@VadimYuryev

Sep 19, 2023

Apple's A17 Pro chip... I think I finally figured out why we're not seeing impressive efficiency improvements, even with 3nm from TSMC.

As you can see, Apple had to dramatically increase power usage in order to give us the A17 Pro's performance gains.

Here's the story:
1/13

Remember when yields for TSMC's initial N3B chips were very low, around 55%? This is because Apple has high standards for efficiency or performance per watt.

Unfortunately, TSMC was unable to improve so Apple had to LOWER the efficiency standards across the board in order..
2/13

..to improve yields by accepting 3nm dies that they would have normally tossed due to having worse efficiency, in order to improve yields. See screenshots from @Tech_Reve

Apple lowered performance and efficiency goals to improve chip yields.

3/13 x.com/Tech_Reve/stat…

Read 13 tweets

Vadim Yuryev

@VadimYuryev

Jul 1, 2022

Apple REALLY messed up with the M2 MacBook Pro and in this thread, I am going to prove it.
As you might know, Apple downgraded the base 256GB SSD from two 128GB NAND chips to a single 256GB, which cuts the channels in half, reducing SSD speeds.
Other reviewers have said..
1/10

that this won't impact the performance of the M2 MacBook Pro in any noticeable way, but that is WRONG!

With only 10 Chrome tabs open, the base M2 MBP exported a common 5min 4K HEVC clip SLOWER than the previous base M1 MBP. This should NEVER happen on ANY new product.
2/10

We then compared two M2 MBP models, both with 8GB of RAM but one with the base 256GB SSD and one with a 512GB SSD to ONLY test the SSD differences alone.
With no other apps and no internet tabs open, the 512GB model was 14.2% FASTER! Yes, from an SSD upgrade alone! 🤯
3/10

Read 12 tweets

Vadim Yuryev

@VadimYuryev

Jun 30, 2022

https://twitter.com/9to5mac/status/1542587813966774275

I completely agree! Nobody that cares about performance should buy the M2 MacBook Pro.

We are currently testing two M2 MBP models. Both with 8GB of RAM. One has 256GB SSD, and the other has 512GB. We have some bad news...

Casual users should WAIT for the M2 MacBook Air, 100%!

https://twitter.com/9to5mac/status/1542587813966774275

Yes, some argue that 8GB of RAM is not enough, and that's CORRECT!
However, if you buy the 16GB model, the slow 256GB SSD will be impacting swap performance as well, which even the 16GB RAM model still uses.
So you've gotta upgrade both the RAM and SSD for optimized performance..

The problem is that a 16GB RAM/512GB SSD M2 MacBook Pro costs $1700, which is only $100 less expensive than the 14" MacBook Pro on sale on Amazon for $1800 which is MUCH faster and packs tons of features and upgrades: geni.us/IivvB (affiliate)

Read 4 tweets

Vadim Yuryev

@VadimYuryev

Jun 30, 2022

https://twitter.com/every_daydad/status/1542525468653518848

I will concede and admit that my initial comment was a bit extreme to say that it needs a better cooling system since most people buying the $1299 MacBook Pro won't be pushing it to the limits as we did in order to test the cooling system vs the M2 chip. Most people will be..
1/4

https://twitter.com/every_daydad/status/1542525468653518848

browsing the web, watching YouTube videos, and using common apps most of the time. For the most common video editing work using H.264 or HEVC codecs, those will run fine because the media engine handles it.

The main point I was making was that under a FULL load, the single..
2/4

fan is not able to cool down the system by itself, leading to massive throttling of the chip, with it literally cutting power from 29W to 7W in order to cool it down. That is severe throttling. We tested the M1 MBP and M1 Pro MBP and the cooling system was adequate. Neither..
3/4

Read 4 tweets

Vadim Yuryev

@VadimYuryev

Jun 29, 2022

Why the SLOW 256GB M2 MacBook Pro SSD speeds MATTER:
There's a good chance that the base M2 MacBook Air will also come with a single 256GB NAND chip, leading to the same slow SSD problems.
This matters because we looked at our own Amazon Affiliate sales data since Nov 2020 and...

our viewers purchased 2.9x more base 256GB SSD M1 MacBook Air models compared to 512GB models.
They also ordered 6% more 256GB M1 MacBook Pro models compared to 512GB models.
Keep in mind that these are viewers who care enough about tech to watch our videos and learn about..

which Macbook they should buy, and STILL, more people bought base 256GB models.

This is unfortunate because the slow SSD speed on the 256GB M2 MBP slows down virtual memory swap, leading to WORSE performance than the 256GB M1 MBP with a RAM load of just 10 Chrome tabs open.

Read 6 tweets

Vadim Yuryev

@VadimYuryev

Jun 29, 2022

We discovered SEVERE thermal throttling with Apple's new M2 MacBook Pro, proving that it needs a BETTER cooling system with two fans instead of one.
We exported 8K Canon RAW and saw temps hit 108°C, more than we've ever seen on a Mac, even an Intel Mac.
But it gets worse...
1/7

The fan was maxed out at 7200RPM the ENTIRE time, so there was nothing the MacBook Pro could do to cool itself down except for HEAVILY throttle down the M2 chip. This led to much worse performance than the M1 Pro chip, which didn't have to max out its fans.
2/7

In a split second, the M2 chip would cut its P-core clock speed from 3200MHz to 1894MHz, its E-core from 2228MHz to 1444Mhz, its GPU from 1393MHz to 289MHz. This resulted in total package power dropping from 29.46W to 7.31W.
3/7

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Vadim Yuryev

Try unrolling a thread yourself!

More from @VadimYuryev

Vadim Yuryev

Vadim Yuryev

Vadim Yuryev

Vadim Yuryev

Vadim Yuryev

Vadim Yuryev

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!