SwiftOnSecurity Profile picture
computer security person. former helpdesk.

Dec 13, 2023, 28 tweets

Today at work there was a 11hr outage bridge call over a small but important area of the Windows network having application errors. Not debilitating but would be eventually.

Here is how I helped solve it after being asked to take a look, and how I approach problems like this. 🧵

I’m an IT Generalist who started in Helpdesk and system engineering. I now work in Security, using those skills to push initiatives forward. Often, to troubleshoot complaints and impediments these projects encounter. I talk with lots of teams, so I’m periodically asked to consult

Today’s issue was Microsoft Edge launching but all functionality not working except the basic chrome. The settings tab would open but also show an error. F12 dev tools would not open. Strangely, an IE Mode tab would work.

I am presented this problem. I start to dig in.

This problem occurred progressively to 100 machines seemingly all in the same OU. It was not strictly Edge version-dependent from site staff. Interesting data-point. Heterogeneity in a homogenous managed network. Take note.

No changes happened. Moving machines to new OU no help.

Site staff remote into an impacted computer to share screen with me. Something I always do is get hands-on to get a feeling for the machine. Vibes are important.

Start Menu doesn’t work? Staff says it’s restricted. I right-click to get admin PowerShell prompt. Auth with LAPS.

(They showed me issue, described earlier)
I launch compmgmt.msc. Say I’m just looking around. Weird crashes. ShellExperienceHost.exe.
Start Menu is not restricted. It’s broken somehow. “How long like this?” Years.
But that’s not the problem I’m called for.

Maybe. I’m suspicious.

Edge is failing in an extremely unusual way it’s not even triggering error handling for. Unusual. Nothing other staff had tried or guessed at with multiple departments giving input has assisted.

So this has to be really freaking weird. Maybe.

I try DISM/SFC. Nothing as expected

I look at the Group Policy applied to this OU. Lots of super old stuff obviously ported-forward across maybe 20 years. Nothing immediately obviously linked, but the problem seems to be OU-dependent.

No GPO edits last year in structure. Dates not absolute but reliable for now.

Site staff can replicate issue. Image machine, install LOB apps, works fine. Put it in OU. It breaks. But nothing has changed, probably?

I ask SCCM/others etc about pushes. Absolutely nothing. This network area is tagged as a critical change control zone.

Broken anyway. Exotic.

Seeking variables I disable all non-Microsoft services. Inspect all startup/login scripts. No changes or help.

Start Menu not working bugs me. Say it’s normal but it’s not. Could this be a clue I can key off of? Normal diagnostics to fix it don’t help at all. I explain to call:

You have two very unusual issues on a machine. I do not care one of them is old. Something with a Windows subsystem is potentially a fault as a root cause here. We are going to address Start Menu. Not Edge. Lots more advice on Start Menu being a symptom to find resolutions anyway

I ask tech to transfer ProcMon and AutoRuns to system via USB, which they can access logged in with personal service account bypassing USB control. These machines cannot access most of network like SMB shares. Very unusual for changes to penetrate them. They are really obtuse.

Go through AutoRuns, nothing notable.
I launch ProcMon and immediately try Start Menu, then pause capture. I look for what happens to ShellExperienceHost RIGHT before WerFault.exe is called. This is not reliable for issue identification but can give hints.

Failed registry reads?

The process is unable to read (for example) HKLM\Software\Microsoft\OLE. Really basic eternal low-level stuff literally -anyone- should be able to access.

ProcMon results have lots of nuance in what you think they say, but I check regedit. Fine. Check effective access. Fine.

Is this red herring? Literally the _last_ thing that occurs before process crash, not a cleanup routine. This could be important.

I go back to check my assumptions and paths of causality. It _has_ to be a GPO. Got more troubleshooting results that confirm it’s OU. But what?

This stage of troubleshooting requires synthesizing several areas of windows internals knowledge.

1.) The GPO DOES manage Registry permissions to HKLM\Software. But it has admins, users, system, etc all as expected. Should be zero limitations

2.) Managing registry ACLs really sucks, and most do it through a crappy ancient editor. The goal of some IT tech 15+ years ago was to grant FULL CONTROL rights to the registry keys of an LOB app in HKLM

3.) This app is likely from XP era when apps assumed they were always admin

4.) Registry ACL edits in Group Policy are “tattoo” operations. They do not get reverted when you move the machine out of an OU. This would explain issue persistence even moving out of OU. This and the fact being in OU once ever breaks Start Menu forever, aligns.

5.) The fact this tattoo operation is happening in the same OU as Start Menu and Edge breaking forever is likely not coincidence.

6.) Windows8 introduced AppX subsystem toisolate “modern” Windows applications from running literally as the user principal.

7.) Start Menu is NOT accessing the Registry as the user! It accesses it under the “ALL APPLICATION PACKAGES”

8.) The GPO is overwriting the HKLM\Software ACL with principals from years ago probably Win7, BEFORE this existed!

So we have theory on Start Menu:

IT tech 15 years ago hard-coded an ACL in Group Policy which does not include modern Windows principals.

This breaks Start Menu. Site staff thought this was a security limitation. It’s not. It was a technical error. I was the first to Q?

But..

The problem is with Edge. This cannot be related. It’s been like this for years. It’s not related to Edge version.

BUT there is ONE variable you are not considering.

Some modern software is not controlled by version. It gets feature testing flags iteratively rolled out. How?

This is controlled in Edge by the Experimentation Service policy. I presume this is the cause, but didn’t prove it tonight.
Microsoft is probably testing some security hardening setting that leverages AppX isolation?



Back to thread and chosen resolutionadmx.help/?Category=Edge…

Again, managing registry ACLs sucks. It’s possible but for critical devices and without 3rd party tools like SetACL I needed a proven solution.

So use Group Policy to set “HKLM\Software” ACL to grant “ALL APPLICATION PACKAGES” query, enumerate, notify, and read ACL inheritably.

We do this in a test OU. Move broken machine in. gpupdate /force and reboot.

It comes up.

⚠️🚨⚠️🚨⚠️

The Start Menu works.

SO DOES EDGE. The line-of-business application server page is loaded instantly for <redacted>.

Holy shit. Found the cause and a countermeasure. ~45min.

This is NOT end. It has to be rolled out under change control iteratively in batches.

AND it’s not clean. This is a countermeasure. ANY hardcoding of ACLs in a machine is a terrible fucking idea. As we have proven today. Machines are stained.

When done this GPO gets replaced.

Aftermath:

This is not an unknown problem. In further research, Microsoft is aware of the legacy a bad Registry management wizard 20 years has had on the machines of today.

Permissions hardcoded in Group Policy in a different era have ongoing detriment.

learn.microsoft.com/en-us/troubles…

NOTE: On consult with Microsoft staff about the experiments policy, that may not be a factor here. It was NOT proven. Tech was already on a 12hr day.
We know of this policy and do NOT use it on employee machines on purpose, so Microsoft can see small impacts. Do not use blindly.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling