We have eight shapes in a bag that are either triangles, squares, or circles. Their color is either red or green.
draw_obj(example_obj);
display_dfs(example, 'Dataset', proba, 'Joint and marginal distributions')
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
Bi | circle | square | triangle | P(Ai) |
---|---|---|---|---|
Ai | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
P(Bi) | 0.5 | 0.125 | 0.375 | 1 |
In the second table, the six cells top left show the joint probability distribution P(Ai,Bi). The final column and final row show the marginal probability distributions for A and B, respectively.
Recall from the lecture:
Let A denote an event and let P(A) denote the occurrence probability of A. Then the entropy (self-information, information content) of A is defined as −log2(P(A)).
Let A be an experiment with the exclusive outcomes (events) A1,…,Ak. Then the mean information content of A, denoted as H(A), is called Shannon entropy or entropy of experiment A and is defined as follows:
H(A)=−∑ki=1P(Ai)⋅log2(P(Ai))
In other words, the Shannon entropy gives a us measure for the degree of "surprise" from learning the value of A.
display_dfs(example, 'Dataset', proba, 'Joint and marginal distributions'); entropy_calculation(example, 'A')
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
Bi | circle | square | triangle | P(Ai) |
---|---|---|---|---|
Ai | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
P(Bi) | 0.5 | 0.125 | 0.375 | 1 |
For a binary variable, the entropy reaches a maximum of 1.0 when both outcomes are equally probable.
In other words, an entropy of 1 bit corresponds to the information content of a (fair) coin toss.
binary_entropy()
display_dfs(example, 'Dataset', proba, 'Joint and marginal probabilities', '\n'); entropy_calculation(example, 'A', 'B')
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
Bi | circle | square | triangle | P(Ai) |
---|---|---|---|---|
Ai | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
P(Bi) | 0.5 | 0.125 | 0.375 | 1 |
We now want to answer the question, how much information about variable A is still uncertain, after we learn the value of variable B?
Recall from the slides:
Let A be an experiment with the exclusive outcomes (events) A1,…,Ak, and let B be another experiment with the outcomes B1,…,Bs. Then the conditional entropy of the combined experiment (A∣B) is defined as follows:
H(A∣B)=∑sj=1P(Bj)⋅H(A∣Bj)
where
H(A ∣Bj)=−∑ki=1P(Ai ∣Bj)⋅log2(P(Ai ∣Bj))
display_dfs(example, 'Dataset', proba, 'Joint and marginal distributions', entropies, 'Entropies', '\n', cp, 'Conditional probabilities $P(A_i | B_i)$', ch, 'Specific Conditional entropies');
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
Bi | circle | square | triangle | P(Ai) |
---|---|---|---|---|
Ai | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
P(Bi) | 0.5 | 0.125 | 0.375 | 1 |
entropy | |
---|---|
variable | |
H(A) | 0.954 |
H(B) | 1.406 |
Bi | circle | square | triangle |
---|---|---|---|
Ai | |||
green | 0.5 | 0 | 0.333 |
red | 0.5 | 1 | 0.667 |
H(A|Bi) | |
---|---|
Bi | |
circle | 1 |
square | 0 |
triangle | 0.918 |
The expression H(A∣Bi) reads as "the entropy of variable A among only those records that have B=Bi."
In our example, if we know that B=′circle′, the value of A is exactly as uncertain as a coin toss, but not so for the other possible values of B.
The expression H(A|B) is the average of H(A|Bi) over all possible values of B, each weighted by their probability.
display_dfs(proba, 'Joint and marginal distributions', cp, 'Conditional probabilities $P(A_i | B_i)$', ch, 'Specific Conditional entropies');
calc_cond_ent(ch, proba, "A", "B")
Bi | circle | square | triangle | P(Ai) |
---|---|---|---|---|
Ai | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
P(Bi) | 0.5 | 0.125 | 0.375 | 1 |
Bi | circle | square | triangle |
---|---|---|---|
Ai | |||
green | 0.5 | 0 | 0.333 |
red | 0.5 | 1 | 0.667 |
H(A|Bi) | |
---|---|
Bi | |
circle | 1 |
square | 0 |
triangle | 0.918 |
display_dfs(entropies, 'Entropies');
calc_cond_ent(ch, proba, "A", "B")
entropy | |
---|---|
variable | |
H(A) | 0.954 |
H(B) | 1.406 |
The information gain tells us how much we learn about variable A by knowing the value of variable B, expressed as the reduction in entropy:
H(A)−H(A∣B)=0.954−0.844=0.11