Machine Learning Interpretability — Shapley Values with PySpark

Interpreting Isolation Forest’s predictions — and not only

The problem: how to interpret Isolation Forest’s predictions

More specifically, how to tell which features are contributing more to the predictions. Since Isolation Forest is not a typical Decision Tree (see Isolation Forest characteristics here), after some research, I ended up with three possible solutions:

1) Train on the same dataset another similar algorithm that has feature importance implemented and is more easily interpretable, like Random Forest.

2) Reconstruct the trees as a graph for example. The most important features should be the ones on the shortest paths of the trees. This is because of how Isolation Forest works: the anomalies are few and distinct, so it should be easier to single them out — in fewer steps, thus, shortest paths.

3) Estimate the Shapley values: the marginal feature contribution, which is a more standard way of identifying feature importance ranking.

Options 1 and 2 were not deemed as the best solutions to the problem, mainly because of the difficulties in how to pick a good algorithm — Random Forest, for example, operates differently than Isolation Forest, so it wouldn’t be very wise to use its feature importance output. Additionally, option 2 does not generalize and it would be like trying to do the algorithm’s work all over again.

For option 3, while there are a few libraries, like shap and shparkley, I had many issues using them with spark IForest, regarding both usability and performance.

The solution was to implement Shapley values’ estimation using Pyspark, based on the Shapley calculation algorithm described below.

The implementation takes a trained pyspark model, the spark dataframe with the features, the row to examine, the feature names, the features column name and the column name to examine, e.g. prediction. The output is the ranking of features, from the most to the least helpful.

What are the Shapley values?

The Shapley value provides a principled way to explain the predictions of nonlinear models common in the field of machine learning. By interpreting a model trained on a set of features as a value function on a coalition of players, Shapley values provide a natural way to compute which features contribute to a prediction.[14] This unifies several other methods including Locally Interpretable Model-Agnostic Explanations (LIME),[15] DeepLIFT,[16] and Layer-Wise Relevance Propagation.[17]

source: https://en.wikipedia.org/wiki/Shapley_value#In_machine_learning

Or as the python shap package states:

A game theoretic approach to explain the output of any machine learning model.

Read the rest of the story here: https://karanasou.medium.com/machine-learning-interpretability-shapley-values-with-pyspark-16ffd87227e3

Machine Learning Interpretability — Shapley Values with PySpark

Interpreting Isolation Forest’s predictions — and not only

The problem: how to interpret Isolation Forest’s predictions

Buy Maria Karanasou a coffee

More from Maria Karanasou