12,399 views

Lars Albertsson

@lalleal

, 9 tweets, 9 min read

My Authors

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud I have helped many companies on this journey. Some unsolicited advice:

1. Strive to get from a push-based workflow (fill the lake) to a pull-based. I.e select use cases of business value and ingest the data they need into the lake.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud 2. Take use cases all the way and show value before embarking on new use cases.

3. Implement only what the use cases need, but first paint a clear long-term goal picture. Each step should take you towards this goal.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud 4. Only security and privacy needs proactive implementation. The rest can wait.

5. There is a tradeoff between data speed and innovation speed. Use the slowest form of integration your use cases can tolerate. Batch >> streaming >> microservices. Gravitate towards batch.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud 6. Use simple technology and don't over-engineer. New, shiny things make very little difference.

7. The only "new" technology that you need is a workflow orchestrator. They are simple, and glue your fragile components together to a robust system. Use #Luigi or @ApacheAirflow.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud @ApacheAirflow 8. Align your teams along the use cases, not by functionality. During the first year or two, each team should be able to deliver use cases from raw data to business value.

9. Make cross-functional teams, with sufficient combination of skills to be autonomous.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud @ApacheAirflow 10. As a counter weight to entropy caused by autonomy, make conscious decisions on architecture and technology selections.

11. Constantly fight against entropy and limit heterogeneity and degrees of freedom.

12. Avoid components that cannot be managed through source code.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud @ApacheAirflow 13. Keep your datasets immutable. Never change a dataset, unless there was a bug, and it is incorrect.

14. Collect and store raw data without processing first.

15. For collected personal data, split the PII out and make a link to the personal data to prepare for deletion.

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud @ApacheAirflow 16. Separate integration from computation. Computation should be performed without awareness of coordinates, and be easily testable. Data transfer and integration should be a separate step. E.g. for egress, first save results to a lake dataset, then copy to a destination database

@_J_sinclair

@_J_sinclair

@_J_sinclair @HoloMarkeD @GCPcloud @ApacheAirflow 17. When collecting events, avoid processing, improving quality, or reordering. Collect the raw events and partition by arrival time, then curate events in batch (or stream) processes. Curation needs depend on downstream use, and you should avoid making a global decision.

Enjoying this thread?

Keep Current with Lars Albertsson

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

Related hashtags

Related threads

Trending hashtags

Embed code for your website

Did Thread Reader help you today?