✨🧠 The ecosystem that has grown up around @TensorFlow in the last few years blows my mind. There's just so much functionality, compared to some of the other, newer frameworks.
👉Consider this an ever-expanding thread for me to take notes + wrap my brain around products. Ready?
It's no secret that I 💕 #TFX and all of its tooling for deploying machine learning models into production. If you care about keeping your models up-to-date and monitoring them, you should check out the product + its paper.
If you want to train your model on a small data set, or improve generalization, you'll need to use something called transfer learning. #TFHub modules make it easy—and are available in an #OSS marketplace: tfhub.dev.
How can you automatically ensure that the data being used to retrain your model is of the same format, source, naming conventions, etc., as the data that was used to train your model initially?
On a similar vein, you'll probably want to automatically preprocess the data you use to retrain: nprmalizing specific features, converting strings to a numeric value, etc. Transform does this for single examples + batches.
⚖️ My favorite use case for @TensorFlow Model Analysis is to check for any potential ethical issues in my model's input data or in its inferencing. You can interrogate data to ensure that no groups are being negatively impacted.
Serving makes it easy to deploy new algorithms + experiments, but keep the same server architecture+APIs. It works out of the box with @TensorFlow and can support other models, as well.
A ridiculousy cool visualization tool that comes out-of-the-box with @TensorFlow. #TensorBoard visualizes logs that are collected as your model runs; and has dashboards for scalars, histograms, distributions, graphs, images, audio, more.
🤳Allows you to deploy models on mobile + embedded devices. If you've seen the nifty @Android apps that detect diseases on plant leaves, or tiny @Raspberry_Pi-equipped robots with #AI skills, they're probably using #TFLite.
This is a #JavaScript library for training and deploying ML models in the browser and on Node.js. If you've used and loved @TensorFlow Playground, or the #GAN playground, #tfjs is behind both of 'em.
Swift for @TensorFlow catches type errors and shape mismatches before running your code, and has Automatic Differentiation built in. It gives you eager execution, and *much* better usability.
#Keras is now embedded within @TensorFlow as tf.keras, which means that if you don't want to poke around in low-level weeds, you can still implement graphs + build models with the user-friendliness of a high-level API. 😊
#Tensor2Tensor is an #OSS library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research. It also offers a high-level guide for when to deploy those models, and why.
XLA is a domain-specific compiler for linear algebra that optimizes @TensorFlow computations. The results are improvements in speed, memory use, portability on server + mobile platforms.
Small ASICs that provide high performance machine learning inferencing for low-power #IoT devices. For example: edge TPUs can execute state-of-the-art mobile vision models such as MobileNet V2 at 100+ fps, in a power efficient manner.
You can map 8-button input to a full 88-key piano; automatically create melodic accompaniments; use machine learning to display visuals for music; transcribe tunes; generate new sounds; more.
This also doesn't get talked about *nearly* enough.
Seedbank is an ever-expanding collection of interactive machine learning examples that you can use, modify, experiment with, and grow to meet your needs+use case. research.google.com/seedbank/
😀Okay, so they're not specific to @TensorFlow - but this is such a wonderful tool that I'd be remiss not to mention it! Interactive #Python notebooks, free to use - and you can toggle between CPU/GPU/TPU or local/remote backends!
#DeepLearning is great, but, as a data scientist, you'll probably want to encode domain specific knowledge to inform your models: Monte Carlo, variational inferencing, Bayesian techniques, vector-quantized autoencoders, more.
There's also this crazypants **huge** collection of models that have been open-sourced by @GoogleAI and the @TensorFlow community, including samples and code snippets. Everything from boosted trees to neural program synthesis. 😳
Nucleus is a library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM or VCF. It also offers painless integration with @TensorFlow / tfrecords.
A cluster of 1,000 @GoogleCloud TPUs that provides the machine learning research community with a total of 180 petaflops of raw compute power — at no charge, free, $0 — to support the next wave of breakthroughs.
👫 Not a specific product, but vital for a healthy ecosystem.
@GoogleAI's new focus on community - spearheaded by @Edd - features mailing lists, a social media presence, special interest groups, & direct input for new / changing features in TensorFlow.
Did you know all of our docs have been placed on @GitHub? Contributions and suggestions from the community are welcome! Go ping @billylamberta et al for how to get started. 😊
@fly_upside_down, @rstudio, & @fchollet have created an R interface for developers. It uses high-level #Keras + Estimator APIs; and gives more control when you need to tweak networks at a lower level.
Algorithms for adaptively learning the structure of / optimizing the weights for deep neural networks. If you want to learn more about automated machine learning internals, its tutorials are a great place to start!
Interpretability—being able to explain why DNNs make the decisions they do—is *vital* for ethical machine learning and for the application of deep learning to high-consequence use cases.
A similar vein: most interpretability methods show importance weights in each input feature (e.g, pixel). TCAV instead shows importance of high level concepts (e.g., color, gender, race)—how humans communicate.
⚖️ PS: if you're interested in ensuring your algorithms are behaving in an ethical manner, I highly recommend taking @GoogleAI's 70min fairness in machine learning crash course:
If your models are only as good as their input data, bad actors can strike by manipulating or contaminating it. Enter cleverhans, @goodfellow_ian's library for benchmarking vulnerability to adversarial attacks!
🎼Want to be sure to mention this #Magenta project:
You've heard of OCR—automatically detecting alphanumeric characters in images. This is the same concept applied to sheet music: notes are automatically transcribed into a structured format (MusicXML) 🎶
I mentioned #rstats support, and want to make sure to mention these other community-driven projects, as well. (TensorFlowSharp was created by @migueldeicaza! 😊)
📈 Note: the package is pip-installable, and can be used for many kinds of data quality checks - even outside of @TensorFlow machine learning experiments.
Two common use-cases within ML pipelines: (1) validation of continuously arriving data; (2) training/serving skew detection.
✨📊 The free #dataviz tool you see displayed here is called Facets, and was created by People & AI Research (PAIR).
💡Its motivation is to help machine learning and data science practitioners build better models by understanding patterns in their data.
✨🥁 If you have a hankering to experiment with @TensorFlowJS in a friendly setting, try @codepen! Fork interesting examples, riff on their HTML / CSS / #JavaScript, reshare.
If you need batch splitting (data-parallel training), you probably don't need Mesh; but if you do intense distributed deep learning (ex: 5 billion parameters; a large # of activations that can't fit on one device; etc.), you should check it out!
👋 Inspired by recent conversations with friends, and based on a long history of automating away every job I've ever had (from data processing to PM work):
Am sharing a few ways that I'm using Gemini 1.5 and 2M+ tokens of context in @GoogleAIStudio to automate the boring parts of DevRel and UXR!
Reminder that you can stuff quite a bit into 2M+ tokens (hours of video, years of emails, full codebases, etc.) and that, over time, we expect 2M tokens ➡️ infinity, cost ➡️ $0, latency ➡️ near instant.
(1) Uploading a dated codebase (in this case, Flax 0.7.5), and a newer version of the codebase (Flax 0.8.5), then analyzing changes.
You can generate documentation changes based on the differences in code; blog posts or release notes describing the code changes; and - a favorite - update old tutorials based on the new versions of the APIs.
(2) Analyzing product feedback at scale by scraping @GitHub and @Gitlab issues, conversations in @Discord and @Discourse forums, social media chatter, etc.
In this example, I scraped a whole bunch of user feedback about the OSS vector database, Chroma, and compared it to feedback on a competitor's tool (Qdrant).
✨🤔 Wondering how far a person can get with "make this code faster", "make this code more readable and reusable", "refactor this code to be more concise" in the prompt.
👇🏻Am also impressed Bard deduced that I was attempting to implement a multiplication table!
✨👩💻 Jazzed to imagine a future where we all have friendly, competent technical assistants that cheerfully answer n00b questions about chemistry, physics, math, and programming.
📝 Citing sources would be a strong next step, just as we cite potentially recited code in snippets!
✨Bard even recommends unordered maps instead of ordered maps in C++!
📊 Is anyone else *super* dissatisfied with the tech industry's preferred/tracked open-source metrics?
@github stars; pip install or download counts; @-mentions or tags on social media: all of these stats can, and will, be gamed. We can do much better!
👇🏻Here are some ideas:
@github (1) Projects listing a particular repo as a dependency.
This can be easily tracked via GitHub's dependency graph, or by scraping which Dockerfiles, conda environment YAMLs, etc. reference a library or framework.
(2) "Bus factor" of a particular open-source project.
Bus factors measure how resilient a project is to sudden engineering turnover - and is a solid method of understanding the health of an open-source project.
🤖 Reinforcement learning in production is a very nascent space, but a fast-growing and multi-faceted one (everything from game dev to operations research)!
👇To showcase this, am compiling a list of projects that are using @raydistributed and RLlib to enable their experiments:
Everything from multi-agent reinforcement learning; to game balancing and boss optimization; and (even sometimes outside of the realm of RL, but still powered by Ray): in-app game recommendations.
⚡️ This scenario is very near to my heart. Did you know that you can optimize electricity use in a plant or a home; model thermal grids; and manage energy resources efficiently using RL models?
The longer I work on open-source ML tools, the more convinced I become in decoupling libraries.
Crafting simple, delightful, and composable user-facing APIs is *endlessly difficult*; you shouldn't also have to have a PhD in distributed systems in order to make those APIs scale.
Library authors should be able to focus on building concise, extensible features for their users–that help domain experts go from having an idea, to realizing it, as quickly as possible.
Asking those authors to worry about hardware, or data / model parallelism, is unreasonable.
And having to communicate to a user that (as an example) an image preprocessing feature that worked in one framework, won't work in another–
that they have to hunt down an identical transformation, in the context of the new framework–