1️⃣Defining the problem
2️⃣Labeled data
3️⃣Cost of being wrong
4️⃣Model decay
5️⃣Adversarial actors
6️⃣Biased feedback Loops
7️⃣Data access
8️⃣Model interpretability
9️⃣Model instability
You can't accurately mitigate malicious activity unless you can define malicious activity.
At some point, all machine learning methods need a sizable amount of labeled data. This relies on 1️⃣ and highly trained, human labelers.
Most ML libraries default to assuming false negative and false negatives have the same cost. This is rarely true in security problems. Accurate quantification is difficult.
(PS online learning is really hard.)
Data generated from cyber-systems (especially those being influenced by adversarial actors) aren't stationary! Models that work now, won't work eventually.
Bad actors have an incentive to bypass your models. Repeated access to model predictions allows these actors to learn about your models.
Mitigating malicious activity changes the nature of your data and makes it difficult to retrain. (Read this! ai.google/research/pubs/…)
By the very nature of security, data (essential for training) will be difficult to access. (Even before GDPR.)
Customers, regulators, analysts, CEOs may demand explainability. Many models are hard to explain. This may restrict your space of features and models.
Updating (retraining) machine learning models can introduce “regressions” in the classifications. How do you update your model without negative consequences (e.g. non-malicious things being labeled as malicious.)