trying to reproduce a system issue that presents itself in a complex ETL when run at full scale on a given system is tough.
but it’s a task i can do sometimes, and want to be able to do more often.
it’s an important task bc attaching a debugger to production is risky. and extended events is of limited use when some system symptoms are known but query/worker/memory condition contributors aren’t.
the rules of the game: i almost never get to work with the production system itself. can supply some low-impact information gather/logging tools.
probably won’t be able to work with production data in nonprofit database, either.
so i have to profile the queries and data well enough (or at least the queries and query behavior) along with system behavior such that I can try to recreate system behavior on another system.
but there’s always outstanding questions. did i toss out a workload component important to the repro bc i dismissed it? is the scenario threshold related and my data/system utilization/query concurrency doesn’t meet the threshold?
maybe something like sublatch/super latch is involved, requiring a system with 16 or more vcpus and enough activity/contention to result in promotion?
maybe large memory pages are involved in a locked pages memory model system, in which case have to have enough total RAM to get some large pages and large page amount depends on runtime memory state at startup?
maybe the condition is only triggered when certain tasks/workers intersect on the same scheduler/vcpu?
maybe such-n-such a query needs to be run in the default resource pool, or the default workload group - to bring out the behavior.
did i match trace flags and #sqlserver version to the original context?
what about database scoped configuration, transaction isolation level, session SET options?
Stats auto-update and auto-update async match the original context?
are stats in the same stale/fresh state?
make sure to consider warm/cold cache state for the workload in the original context.
batch mode queries in original? getting that in the repro?
what about adaptive joins, UDF inlining?
Accelerated database recovery and recovery model?
vNUMA and auto-softNUMA?
Any hekaton use in production or repro?
what about dbcc checkdb/checktable or explicit snapshot databases? tempdb in-memory metadata? Query Store?
Did a #sqlssrver backup intersect with the production workload?
Is cdc enabled?
What about extended events sessions enabled in production?
Does the production #sqlserver error log indicate anything else that might be important for the repro? Like 15 second IOs, or latch timeouts, or cache flushed, or top level block allocations. or changes to MSM or resource governor. transaction rollbacks.
this list is more for me than anyone else LoL
Does the windows log give me any clues? Application level errors - or application level logged successes?
Oh, yeah - make sure memory model (conventional, locked pages, large pages) matches between production and repro.
deleted rows and delta stores matter for columnstore.
page fill (for the sake of page splits) and sparseness (for the sake of IAM chain length) may both matter.
also presence of row_overflow_data and LOB_data allocation units in relevant HoBTs.
double check max workers in production. and default packet size. check packet size for production workload - may have been overridden from default.
when the repro system and workload seem to have all the right ingredients, but the sought-after condition is still elusive, there’s no telling where the last stone for the arch might be hidden.
A couple of cases i can never forget.
1) worked with @JoeObbish when he found partition switches can take a long time if the column_id values for the same-name columns across the partitions aren’t the same. (misalignment can come from adding/dropping columns in either).
2) @ExadataDBA and i worked on a confounding #oracleDB issue years ago. A RAC system, HP servers with a Juniper switch. sometimes an ETL query would be almost silent for hours while the system was largely idle, then suddenly spring to life and complete.
Database block size was 32kb iirc.
The secret involved the way RAC works, SGA shared cache states, and a Juniper switch bug.
a table fairly popular in the ETL workload was partially cached in SGA on node B; the prone-to-long-silence query was running on node A.
RAC uses a “cache fusion” algorithm - a map of cached database blocks and the node they are cached on is shared across all nodes.
when node A wants a database block for SGA that is cached on node B, it grabs it from node B rather than from shared storage. blocks to be cached on SGA are only returned from storage if they’re not cached by any node.
Can’t request *part* of a database block from another RAC node(or from storage) even if only 1 row is needed. Gotta get the whole database block. a 32kb database block is going to require a lot of network packets to be sent.
And in this case, that meant the 32kb database block sent from node B to node A as a collection of network packets through the Juniper switch.
so this is what happened: query running on RAC node A needs 32kb block ax from rac node B, since it’s cached. node A successfully receives about 24kb of the 32kb database block. But in that last 8 kb… a network packet fails checksum at the Juniper switch and is discarded.
TCP has a backoff algorithm for retransmit requests. i believe RAC also has a backoff for cache fusion retransmit requests. Basically for hours, node A never gets more than 24 kb of that 32kb database block.
Whether RAC backs off or not, there is a RAC retransmit request/retry algorithm. it’s vital to how this query was ever able to actually finish.
that was the big question that always vexed me before this was fixed: obviously something went way wrong. And stayed bad for hours. How did the query suddenly wake up?
and the answer is:
activity on node B eventually evicted the 32kb database block from SGA on node B. once RAC retry kicked in on node A, since the 32 kb block was no longer in SGA cache anywhere, it got the block from storage over fibre channel, bypassing the Juniper switch.
When did all the pieces finally fit together for us to understand this?
When an update to the Juniper switch made the mysterious issue disappear 🤣
ok. so for 1) check even minute details about schema and metadata
for 2) check details about data transmission and intervention points: filesystem filter minidrivers, windows filter platform stuff like antivirus, firewall, and any dlls that load themselves along w/#sqlserver
dlls that load with sqlserver is almost always outside my scope; not to say not important but by the time we get there a Microsoft ticket is open and they’ll likely call out the stuff from a memory stack dump :-)
can’t believe i made it this far in the thread without a single reply from a Twitter spam bôt.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with L_ N___

L_  N___ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sqL_handLe

12 Jan
i still can’t separate Bill Cosby from Cliff Huxtable.
doesn’t just keep me from watching the Cosby Show; keeps me from referencing some great scenes from the show i think many people my age know well.
that’s a shame, especially considering all the other talent & people involved.
When Denise made Theo that “cool” shirt?! 🤣
Even when Theo dumps his heart out(although not fully on the up-and-up) to his dad, saying “i’m not like you and i just wish you’d accept me for who i am!?”
that was good stuff.
Read 4 tweets
12 Jan
did i buy AppleCare for my 10 yro’s iPhone or not?
the receipt search begins.
i wonder if my 10 yro’s phone will tell me if it’s covered by applecare?
rear cameras, flash, flashlight no longer work :-(
Read 5 tweets
12 Jan
the idea for today is to cause #sqlserver [total server memory] to exceed [target server memory] by several gb with a rowstore-only workload and *then* read columnstore segments into buffer pool (and column store object pool).
while there are no columnstore related pages in bpool and columnstore object pool does not exist, the impossibly-high [target pages] value means almost no SQLOS free memory on the system.
initializing column store object pool brings [target pages] to a value well under [target server memory (kb)] - though, note [target pages] is measured in 8kb pages.
Read 5 tweets
12 Jan
meaningful use of a 3d graph (rather than stacked/layered/mixed line + area 2d graph) would be unlocking a whole new achievement for me.
my imagination is so thoroughly 2d-based. i can imagine 2d objects in motion: waves, gears, spinning plates. But 3d is a problem for me.
my 17 yro daughter took a wood shop class maybe two years ago. she made a small puzzle cube - four, maybe five pieces.
took it apart in front of me. Put it together. Took it apart and gave me the pieces.
i tried for 30 minutes and couldn’t make a cube again 🤣
Read 5 tweets
12 Jan
Y’all. in your 3 am glitter adventures, please only enter when invited; and don’t break any windows. Be like glitter vampyres, k?

but… look at those smiles. i kinda wanna hear their story.
~~
Women arrested after throwing glitter at man
2022 January 11
fox13news.com/news/clearwate…
i want to know if the coppers followed a glitter trail to their still-warm car. (according to the police the car was still warm, and glitter was found in the car.)
i also wanna know if those mugshots were from this incident or priors 🤣
What i’m sayin’ is, as long as my kids follow the rules (no B&E, no substantial battery, glitter can’t still be in the container when they throw it), if they get arrested for a glittering… i’ll be good for bail/bond with a quickness.
Read 6 tweets
12 Jan
when do the stories people believe about a work of art, regardless of artist intent or history of the piece, become a part of the work of art?
i don’t know.
working with a database engine, i’m an absolute stickler for “design intent” as revealed by documentation and historical development.
i have a particular view of RDBMS cpu scheduling and memory management: unless documented otherwise (in detail with boundaries), if an user query workload with no external calls makes the system unresponsive due to cpu or memory saturation, it’s a bug.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(