Tweet

L_ N___

12 Jan, 38 tweets, 6 min read

trying to reproduce a system issue that presents itself in a complex ETL when run at full scale on a given system is tough.
but it’s a task i can do sometimes, and want to be able to do more often.

it’s an important task bc attaching a debugger to production is risky. and extended events is of limited use when some system symptoms are known but query/worker/memory condition contributors aren’t.

the rules of the game: i almost never get to work with the production system itself. can supply some low-impact information gather/logging tools.
probably won’t be able to work with production data in nonprofit database, either.

so i have to profile the queries and data well enough (or at least the queries and query behavior) along with system behavior such that I can try to recreate system behavior on another system.

but there’s always outstanding questions. did i toss out a workload component important to the repro bc i dismissed it? is the scenario threshold related and my data/system utilization/query concurrency doesn’t meet the threshold?

maybe something like sublatch/super latch is involved, requiring a system with 16 or more vcpus and enough activity/contention to result in promotion?

maybe large memory pages are involved in a locked pages memory model system, in which case have to have enough total RAM to get some large pages and large page amount depends on runtime memory state at startup?

maybe the condition is only triggered when certain tasks/workers intersect on the same scheduler/vcpu?

maybe such-n-such a query needs to be run in the default resource pool, or the default workload group - to bring out the behavior.

did i match trace flags and #sqlserver version to the original context?
what about database scoped configuration, transaction isolation level, session SET options?

Stats auto-update and auto-update async match the original context?
are stats in the same stale/fresh state?

make sure to consider warm/cold cache state for the workload in the original context.

batch mode queries in original? getting that in the repro?
what about adaptive joins, UDF inlining?
Accelerated database recovery and recovery model?
vNUMA and auto-softNUMA?

Any hekaton use in production or repro?
what about dbcc checkdb/checktable or explicit snapshot databases? tempdb in-memory metadata? Query Store?

Did a #sqlssrver backup intersect with the production workload?
Is cdc enabled?
What about extended events sessions enabled in production?

Does the production #sqlserver error log indicate anything else that might be important for the repro? Like 15 second IOs, or latch timeouts, or cache flushed, or top level block allocations. or changes to MSM or resource governor. transaction rollbacks.

this list is more for me than anyone else LoL

Does the windows log give me any clues? Application level errors - or application level logged successes?

Oh, yeah - make sure memory model (conventional, locked pages, large pages) matches between production and repro.

deleted rows and delta stores matter for columnstore.
page fill (for the sake of page splits) and sparseness (for the sake of IAM chain length) may both matter.
also presence of row_overflow_data and LOB_data allocation units in relevant HoBTs.

double check max workers in production. and default packet size. check packet size for production workload - may have been overridden from default.

when the repro system and workload seem to have all the right ingredients, but the sought-after condition is still elusive, there’s no telling where the last stone for the arch might be hidden.

@JoeObbish

A couple of cases i can never forget.
1) worked with @JoeObbish when he found partition switches can take a long time if the column_id values for the same-name columns across the partitions aren’t the same. (misalignment can come from adding/dropping columns in either).

@ExadataDBA

2) @ExadataDBA and i worked on a confounding #oracleDB issue years ago. A RAC system, HP servers with a Juniper switch. sometimes an ETL query would be almost silent for hours while the system was largely idle, then suddenly spring to life and complete.

Database block size was 32kb iirc.
The secret involved the way RAC works, SGA shared cache states, and a Juniper switch bug.

a table fairly popular in the ETL workload was partially cached in SGA on node B; the prone-to-long-silence query was running on node A.
RAC uses a “cache fusion” algorithm - a map of cached database blocks and the node they are cached on is shared across all nodes.

when node A wants a database block for SGA that is cached on node B, it grabs it from node B rather than from shared storage. blocks to be cached on SGA are only returned from storage if they’re not cached by any node.

Can’t request *part* of a database block from another RAC node(or from storage) even if only 1 row is needed. Gotta get the whole database block. a 32kb database block is going to require a lot of network packets to be sent.

And in this case, that meant the 32kb database block sent from node B to node A as a collection of network packets through the Juniper switch.

so this is what happened: query running on RAC node A needs 32kb block ax from rac node B, since it’s cached. node A successfully receives about 24kb of the 32kb database block. But in that last 8 kb… a network packet fails checksum at the Juniper switch and is discarded.

TCP has a backoff algorithm for retransmit requests. i believe RAC also has a backoff for cache fusion retransmit requests. Basically for hours, node A never gets more than 24 kb of that 32kb database block.

Whether RAC backs off or not, there is a RAC retransmit request/retry algorithm. it’s vital to how this query was ever able to actually finish.

that was the big question that always vexed me before this was fixed: obviously something went way wrong. And stayed bad for hours. How did the query suddenly wake up?

https://twitter.com/sql_handle/status/1481324036038791177

and the answer is:
activity on node B eventually evicted the 32kb database block from SGA on node B. once RAC retry kicked in on node A, since the 32 kb block was no longer in SGA cache anywhere, it got the block from storage over fibre channel, bypassing the Juniper switch.

https://twitter.com/sql_handle/status/1481324036038791177

When did all the pieces finally fit together for us to understand this?
When an update to the Juniper switch made the mysterious issue disappear 🤣

ok. so for 1) check even minute details about schema and metadata
for 2) check details about data transmission and intervention points: filesystem filter minidrivers, windows filter platform stuff like antivirus, firewall, and any dlls that load themselves along w/#sqlserver

dlls that load with sqlserver is almost always outside my scope; not to say not important but by the time we get there a Microsoft ticket is open and they’ll likely call out the stuff from a memory stack dump :-)

can’t believe i made it this far in the thread without a single reply from a Twitter spam bôt.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

L_ N___

Try unrolling a thread yourself!

More from @sqL_handLe

L_ N___

L_ N___

L_ N___

L_ N___

L_ N___

L_ N___

Did Thread Reader help you today?

Like this author's thread?