Showing the path
This graph is from the blog, Diving into Data. It used a simple decision tree to predict the house price in Boston(yes, the class if Boston data). Basically, the tree follows the nodes until the very end, while the features contribute to the output. The very basic prediction starts from 22.60. This is the mean is trainset, and also called bias. Starting from the bias, when you follow the red line, the value decreases to 19.96. This is due to ‘RM<=6.94’. After that, “LSAT<=14.40” makes the prediction to 14.91. The process goes on like this, and becomes 18.11. This is similar for other paths too. This gives us a glimpse inside the black box.
The process explained above can be expressed as below:
f(x)=c_{full}+∑_{k=1}contrib(x,k)
k is the number of features, and c_{full }is the value of the root of the node(here, 22.60) contrib(x,k) is the contribution from the kth feature in the feature vector x. It’s like linear regression, but the other features are fixed when one feature moves. However, in decision tree, other features are all inter connected, and that’s why the values for different paths vary.
A random forest is a cluster of decision trees, and the final prediction is the average of each trees. So, the formula would be:
F(x)=^{1}/_{J}∑_{j=1}c_{jfull}+∑_{k=1}^{K}(^{1}/_{J}∑_{j=1}contrib_{j}(x,k))
where J is the number of trees in the forest. According to the equation, Random forest is just an average of the bias terms plus the average contribution of each feature. In the case of classification, prediction is made by taking the majority vote. Random Forest provides a good prediction score because it decreases the variance without increasing the bias. Also, the trees are independent since it uses a bagging technique(repeatedly selecting a random sample with replacement). Indepth explanation about the inside of Random forest will be dealt in future posts.
Interpreting Random Forest using scikitlearn
Ensemble models such as random forest is too complicated for people to interpret, and thus treated as a black box. Using the sklearn treeinterpreter, we can see local predictions, and how the features contribute to those predictions.
The sample dataset is from Kaggle, and it is about predicting grad school admissions, using some features. The features are ‘GRE Score’, ‘TOEFL Score’, ‘University Rating’, ‘SOP’, ‘LOR ‘, ‘CGPA’, ‘Research’, and ‘Chance of Admit’.
The dataset is 500 rows, and I only fed 400 into Random forest. Then when Random forest predicted the ‘Chance of Admit’ of person 410 and 422, the numbers were: 0.453 and 0.793. Their actual scores were 0.54 and 0.73 respectively. There is a slight difference, but it is close enough. The problem here is: how do you know the features that contributed to each predictions? TOEFL Score might have an influence, but what about CGPA? Also, students with high CGPA might have higher GRE Scores since they both stem from their intellectual skills. We do not know the actual impact on the outcome, or even if they actually do have an impact. We could pick a tree and see what contributed to the outcome, but there are thousands of trees, and we cannot simply digest all those information and combine them into a single conclusion. This is why Random forest is counted as a black box.
By using the treeinterpreter from sklearn, we could observe the bias and feature contributions.
Person 410:
Bias(trainset mean) is: 0.72818
Feature contributions:
CGPA 0.21
GRE Score 0.05
TOEFL Score 0.03
SOP 0.01
Research 0.0
University Rating 0.01
LOR 0.02

Person 422:
Bias(trainset mean) is: 0.72818
Feature contributions:
SOP 0.01
LOR 0.01
University Rating 0.0
TOEFL Score 0.0
Research 0.01
GRE Score 0.02
CGPA 0.06

The bias for the whole training dataset is 0.72818. The prediction starts there, and for person 410 CGPA had the most impact on admission. The sum of feature contributions is about 0.2751775. As mentioned before, prediction is the sum of bias and feature contributions. Here, the prediction was 0.453 because 0.72818(bias) – 0.2751775(sum of contributions).
Person 422 seems to have little difference from the bias. It could be the case that the features of the person are close to the mean values of each features. Because they are similar, the tree does not have much information to gain from, and thus does not diverge much from the bias.
Overall average

Person 410

Person 422


GRE Score

316.47200

301

322

TOEFL Score

107.19200

96

112

Univ Rating

3.11400

1

4

SOP

3.37400

3.0

3.5

LOR

3.48400

4.0

2.5

CGPA

8.57644

7.56

9.02

Research

0.56000

0

1

Grad Admin

0.72174

0.54

0.73

By looking at individual features, it is hard to figure out where the difference comes from. However, with feature contributions, CGPA has a big impact on the outcome. Knowing the weight of each feature on the outcome could be a powerful tool when it comes to business. If a consulting firm that helps students who want to get into grad school could focus more on CGPA rather than other features.
Looking into local instances could be helpful but also has its limitations. As I mentioned before with LIME, it is hard to generalize individual instances. It might be helpful for us if we see many instances, but its hard to keep track of all the information, and thus almost impossible to generalize.
For codes, visit my github
The codes are from Diving into Data, as well as the contents.