Calculating Entropy for Data Mining
Entropy and Optimal Encoding
To demonstrate the link between entropy and optimal encoding (optimal
compression), we will use Huffman
coding to develop a binary encoding to use to represent our five page states:
HomePage (h), NewsPage (n), ProductPage (p), ServicesPage (s), and AboutPage
(a). The Huffman algorithm tries to find the smallest code to represent the
most frequently occurring item. The Huffman encodings also have the property
of being prefix codes. You can send them in a continuous stream of bytes
and any recipient supplied with a dictionary (as seen in the character code
table below) can decode them uniquely.
The string to compress is hnpshnhahphshnp. The character
frequency table is:
| Character |
h |
p |
n |
s |
a |
| Frequency |
6 |
3 |
3 |
2 |
1 |
The Huffman compression algorithm generates the following character code
table:
| Character |
h |
p |
n |
s |
a |
| Code |
0 |
10 |
111 |
1101 |
1100 |
Using the character frequency table and character code table, we can compute
the average number of bits used to encode the sequence of 15 characters (for
example, an individual user clickstream):
E(L) = ((6x1)+(3x2)+(3x3)+(2x4)+(2x4))/15
= (6+6+9+8+8)/15
= 37/15
= 2.4666 bits per symbol
The expected length of the symbol encodings using the Huffman procedure is
greater than the Shannon entropy, which represents the theoretical lower bound
on what is achievable. While Shannon entropy gives us a lower bound on the
average number of bits required to encode each symbol, we might also want to
calculate a maximum entropy score that assumes that the probability of each
signal state is the same.
Maximum Entropy
Comparing the Shannon entropy score to the maximum entropy score allows us
to quickly gauge whether some states are more probable than others. If these
two scores are close, this tells us that each state is represented equally
often in the distribution. If they are far apart, then some signal states are
represented much more often than other signal states.
Fortunately, the maximum entropy score is extremely easy to calculate:
Hmax(P) = -N * 1/N * log(1/N) = log(N)
The maximum entropy score is simply the log of the number of possible signal
states, N. If our database column has 100 values but only ten of
these values are unique, then we would take the log of these ten possible signal
states to compute the maximum entropy value. Using ten instead of 100 to
represent N makes sense when you consider that entropy is a
measure defined over a probability distribution and not the total number of
sample values.
Scaling Up
The problem with our current version of the entropy class is that it will
not scale. Try feeding this class 100,000 data points and watch it blow up. We
can make a more scalable version of this class by making it database-aware and
using the database engine to compute intermediate results (notably the token
frequencies). We can also make our class more scalable by adding code that will
allow us to limit the result set by month or by some other criterion that
reduces the amount of data used in the calculation of the entropy. The
TableEntropy.php class below incorporates these features. Notice in
particular the SQL used to compute token frequencies.
<?php
/**
* Computes the entropy of a database column.
*/
class TableEntropy {
var $table = "";
var $column = "";
var $select = "";
var $where = "";
var $group_by = "";
var $num_events = 0;
var $token_freqs = array();
var $token_probs = array();
var $num_tokens = 0;
var $bits = 0.0;
var $maxent = 0.0;
var $ratio = 0.0;
function setTable($table) {
$this->table = $table;
}
function setColumn($column) {
$this->column = $column;
}
function setSelect($sql) {
$this->select = $sql;
}
function setWhere($sql) {
$this->where = $sql;
}
function setGroupBy($sql) {
$this->group_by = $sql;
}
function getTokenFrequencies() {
global $db;
$sql = " SELECT $this->column, count($this->column) as freq ";
$sql .= " FROM ". $this->table;
if ($this->where != "")
$sql .= " WHERE ". $this->where;
if ($this->group_by != "")
$sql .= " GROUP BY ". $this->group_by;
else
$sql .= " GROUP BY ". $this->column;
$this->token_freqs = $db->getCol($sql, "freq");
}
function getEntropy() {
global $db;
$this->getTokenFrequencies();
$this->num_events = array_sum($this->token_freqs);
$this->num_tokens = count($this->token_freqs);
foreach ($this->token_freqs as $token => $freq) {
$this->token_probs[$token] = $freq / $this->num_events;
$entropy += $this->token_probs[$token] * log($this->token_probs[$token], 2);
}
$this->bits = -1.0 * $entropy;
$this->maxent = log($this->num_tokens, 2);
$this->ratio = $this->bits / $this->maxent;
}
}
?>
Monthly IP Traffic Report
Here is some code that computes the total number of visitors and hits for the
www.phpmath.com site, broken down by month. These totals come from an analysis
of the IP address column in a MySQL database table that logs standard access
log information.
<?php
/**
* Script the reports monthly ip traffic.
*/
require_once "config.php";
require_once PHPMATH . "/IT/TableEntropy.php";
require_once PHPMATH . "/IT/ArrayMath.php";
$e = new TableEntropy;
$e->setTable("Webstats");
$e->setColumn("ip");
?>
<html>
<head>
<title>Monthly IP Traffic Report</title>
</head>
<body>
<i>Monthly IP Traffic Report: Hits, Visitors, Entropy</i>
<table border='1' cellpadding='5' cellspacing='0'>
<tr bgcolor='ffffcc'>
<td><b>Period</b></td><td><b>Hits</b></td>
<td><b>Visitors</b></td><td><b>Entropy</b></td>
<td><b>MaxEnt</b></td><td><b>Ratio</b></td>
</tr>
<?php
for($year=2004; $year<=2004; $year++) {
for($month=1; $month<=12; $month++) {
$start = date("Y-m-d", mktime(0,0,0,$month,1,$year));
$end = date("Y-m-d", mktime(0,0,0,$month+1,0,$year));
$clause = " received >= '$start' AND received <= '$end'";
$e->setWhere($clause);
$e->getEntropy();
echo "<tr>";
echo " <td><i>$start to $end</i></td>";
echo " <td align='right'>". $e->num_events ."</td>";
echo " <td align='right'>". $e->num_tokens ."</td>";
echo " <td align='right'>". sprintf("%01.2f", $e->bits) ."</td>";
echo " <td align='right'>". sprintf("%01.2f", $e->maxent) ."</td>";
echo " <td align='right'>". sprintf("%01.2f", $e->ratio) ."</td>";
echo "</tr>";
if($e->num_events != 0) {
$hits[] = $e->num_events;
$visits[] = $e->num_tokens;
$entropy[] = $e->bits;
}
if($year == date("Y"))
if($month == date("m"))
break(2);
}
}
?>
</table>
<?php
$am = new ArrayMath;
echo "correlation(hits, visits): ".$am->correlation($hits,$visits)."<br />";
echo "correlation(hits, entropy): ".$am->correlation($hits,$entropy)."<br />";
echo "correlation(visits, entropy): ".$am->correlation($visits,$entropy);
?>
</body>
</html>
The output generated by this code looks like this:
Monthly IP Traffic Report: Hits, Visitors, Entropy
| Period |
Hits |
Visitors |
Entropy |
MaxEnt |
Ratio |
| 2004-01-01 to 2004-01-31 |
4663 |
735 |
7.20 |
9.52 |
0.76 |
| 2004-02-01 to 2004-02-29 |
3785 |
631 |
6.99 |
9.30 |
0.75 |
| 2004-03-01 to 2004-03-31 |
7293 |
1291 |
8.62 |
10.33 |
0.83 |
| 2004-04-01 to 2004-04-30 |
7910 |
966 |
6.47 |
9.92 |
0.65 |
| 2004-05-01 to 2004-05-31 |
6502 |
1276 |
8.34 |
10.32 |
0.81 |
| 2004-06-01 to 2004-06-30 |
4597 |
837 |
7.57 |
9.71 |
0.78 |
| 2004-07-01 to 2004-07-31 |
4025 |
837 |
8.04 |
9.71 |
0.83 |
| 2004-08-01 to 2004-08-31 |
3400 |
871 |
8.23 |
9.77 |
0.84 |
| 2004-09-01 to 2004-09-30 |
4994 |
923 |
7.63 |
9.85 |
0.77 |
| 2004-10-01 to 2004-10-31 |
5583 |
984 |
8.02 |
9.94 |
0.81 |
| 2004-11-01 to 2004-11-30 |
4389 |
739 |
7.11 |
9.53 |
0.75 |
correlation(hits, visits): 0.74270326185869
correlation(hits, entropy): 0.0050558270393892
correlation(visits, entropy): 0.6465336491674
The entropy scores vary between seven and eight bits. Seven bits can
represent 27 or 128 states. Eight bits can represent 28
or 256 states. Equating bit scores to the number of representable states is a
good way to acquire a sense of the measurement scale properties of the entropy
score. The difference between one and two bits of information is not the same
as the difference between seven and eight bits.
Monthly Page Traffic Report
With minor changes to the monthly_ip_traffic.php script, we can
create a monthly_page_traffic.php script that reports monthly page
traffic:
Monthly Page Traffic Report: Hits, Pages, Entropy
| Period |
Hits |
Pages |
Entropy |
MaxEnt |
Ratio |
| 2004-01-01 to 2004-01-31 |
4663 |
15 |
2.09 |
3.91 |
0.54 |
| 2004-02-01 to 2004-02-29 |
3785 |
13 |
2.11 |
3.70 |
0.57 |
| 2004-03-01 to 2004-03-31 |
7293 |
14 |
1.98 |
3.81 |
0.52 |
| 2004-04-01 to 2004-04-30 |
7910 |
14 |
1.80 |
3.81 |
0.47 |
| 2004-05-01 to 2004-05-31 |
6502 |
16 |
1.82 |
4.00 |
0.46 |
| 2004-06-01 to 2004-06-30 |
4597 |
14 |
1.46 |
3.81 |
0.38 |
| 2004-07-01 to 2004-07-31 |
4025 |
15 |
1.74 |
3.91 |
0.44 |
| 2004-08-01 to 2004-08-31 |
3400 |
13 |
1.76 |
3.70 |
0.47 |
| 2004-09-01 to 2004-09-30 |
4994 |
13 |
1.54 |
3.70 |
0.42 |
| 2004-10-01 to 2004-10-31 |
5583 |
14 |
1.78 |
3.81 |
0.47 |
| 2004-11-01 to 2004-11-30 |
4390 |
14 |
1.74 |
3.81 |
0.46 |
correlation(hits, pages): 0.31172637208172
correlation(hits, entropy): 0.085063761760323
correlation(pages, entropy): 0.14280487789004
The entropy of the page column is quite a bit lower than the entropy of the
ip column, which makes sense when you consider that there are many
more possible site visitor IPs than site pages (about 15) and that each visitor IP
has a relatively low probability of occurrence compared to the probability of
occurrence of a site page. Entropy is low when there are only a few possible
states the system can be in and when some of the states have high individual
probability of occurrence (p(home)).
Is Entropy a Measure of Variance?
Statisticians can make the case that the entropy of a discrete random
variable is a measure of the discrete variable's variance:
In short, H is a "variance-like" index that can be computed for
any discrete distribution, even though the event classes are purely
qualitative ... their main contribution to statistical methods lies, however, in
the possibility of extending the familiar notions having to do with variance
and factors accounting for variance to qualitative situations. ~ Winkler
& Hays (1975) Statistics: Probability, Inference and Decision. Holt,
Rinehart and Winston, p. 840.
While it is useful to think of entropy as a "variance-like" index, be careful
to note that entropy does differ significantly from the quantitive notion of
variance:
Despite the fact that both the entropy and the variance of a
probability distribution measure in some sense the uncertainty of the value of
a random variable having that distribution, the entropy has an interpretation
different from that of the variance. The entropy is defined only by the
probabilities of the possible values of the random variable and not by the
values themselves. On the other hand, the variance depend on these values.
Thus if Y1 takes the values 1 and 2 each with probability 0.5 and
Y2 takes the values 1 and 1000 each with probability 0.5, the
distribution of Y1 and Y2 have equal entropies but quite
different variances. ~ Ewens & Grant (2001) Statistical Methods in
Bioinformatics. Springer, p. 44.
With these thoughts in mind, let's examine a report that computes the
variance and entropy of the ip column on a per-month basis so that
we can compare them. I computed the variance of the ip column by
first counting the daily number of hits (count(ip)) and visitors
(count(DISTINCT ip)). I computed the monthly variance in hits and
visits based on daily hit and visit totals for a particular month. I also fed
the same series of IPs into the entropy class in per-month blocks to base
variance and entropy scores on the same input data. Actually, instead of using
and reporting the raw variance score for hits and visits, I use and report the
standard deviation score for hits and visits because it is easier to interpret
what the variance estimates mean when reported as standard deviations. Web
traffic reports often don't report variance scores, so this script is useful if
you want to begin understanding the extent of variability in your web traffic
data and to start co-relating these variance scores with other concurrent
events that might account for them, such as publication dates of articles.
Monthly Variance and Entropy
| Period |
Hits |
Visitors |
Entropy |
MaxEnt |
Ratio |
| 2004-01-01 to 2004-01-31 |
77.20 |
10.08 |
7.20 |
9.52 |
0.76 |
| 2004-02-01 to 2004-02-29 |
98.30 |
8.39 |
6.99 |
9.30 |
0.75 |
| 2004-03-01 to 2004-03-31 |
169.29 |
30.83 |
8.62 |
10.33 |
0.83 |
| 2004-04-01 to 2004-04-30 |
205.14 |
12.58 |
6.47 |
9.92 |
0.65 |
| 2004-05-01 to 2004-05-31 |
138.74 |
24.59 |
8.34 |
10.32 |
0.81 |
| 2004-06-01 to 2004-06-30 |
82.71 |
9.52 |
7.57 |
9.71 |
0.78 |
| 2004-07-01 to 2004-07-31 |
58.47 |
9.25 |
8.04 |
9.71 |
0.83 |
| 2004-08-01 to 2004-08-31 |
56.95 |
10.86 |
8.23 |
9.77 |
0.84 |
| 2004-09-01 to 2004-09-30 |
85.40 |
18.01 |
7.63 |
9.85 |
0.77 |
| 2004-10-01 to 2004-10-31 |
136.34 |
15.63 |
8.02 |
9.94 |
0.81 |
| 2004-11-01 to 2004-11-30 |
168.13 |
14.33 |
7.55 |
10.07 |
0.75 |
correlation(hits, visits): 0.50421542981268
correlation(hits, entropy): -0.18684271641288
correlation(visits, entropy): 0.60620764775012
The correlation between the Visitors column and the Entropy column is
approximately 0.6, which is a decent correlation, but not so large so as to
tempt me to equate the measurement of IP variance with the measurement of IP
entropy. From this result you might (again) conclude that variance and entropy
are similiar--but not identical--constructs.
When I created this report, I included the current month in the calculation
of the monthly variance in hits and visits as well as the monthly entropy. I
noticed that inclusion of current month statistics changed my reported
correlations quite a bit. I advise against including the current month
statistics because they have less data than other months, and because I haven't
fully explored how radically the entropy measure might vary as new data comes
into it. I encourage you to explore the small sample behavior of the entropy
score further when you have signal probability distributions with shapes
ranging from highly skewed to flat (uniform probability density).
Data Preparation
The analyses reported here use data points that I arguably should have
eliminated or tagged before conducting our entropic analysis of site traffic
patterns. In particular, it would have been desirable to have removed or tagged
spider traffic and site administrator traffic before doing these analyses.
We can supply SQL queries to the TableEntropy.php class that
filters out records by IP address:
<?php
// snippet showing how to filter spider and administrator traffic
...snip...
$time_constraint = " received >= '$start' AND received <= '$end'";
$ip_constraint = " ip NOT IN $banned_ips"
$clause = $time_contraint ." AND ". $ip_constraint;
$e->setWhere($clause);
$e->getEntropy();
...snip...
?>
The IT Package does not address how you compute the list of
banned IPs because it involves many context-specific details about how you log
and prepare your web access log data for analysis. You should, however,
consider adding your own method to the TableEntropy.php class that
computes the $banned_ips to use in the $ip_constraint
of your WHERE clause (for example, $banned_ips =
$e->getBannedIPs()).
Conclusions
This article demonstrated that information entropy is not an esoteric
concept, and indeed might play a practical role in summarizing the amount of
structure contained in database columns.
Our exploration of entropy began at the most basic level: learning how to
compute the entropy score for one discrete random variable. Having mastered
univariate entropy, we can move on to bivariate entropy. Bivariate entropy
involves computing the joint and conditional entropy scores between two random
variables, such as the ip and self columns in the
Webstats table.
Like all entropic analysis, one of the main goals in computing bivariate
entropy scores will be to determine the possibility of reducing uncertainty.
Instead of looking within the sequence of database column values for ways of
reducing entropy (computing bigram and trigram distributions for clickstream
sequences), we will instead be looking across our columns to see how we might
use one data column to reduce our uncertainty about another data column and
vice versa.
Resources
- C. E. Shannon (1948) A mathematical
theory of communication. Bell System Technical Journal, vol. 27, pp.
379-423 and 623-656.
- Roberto Togneri and
Christopher J deSilva (2003) Fundamentals of Information Theory and Coding
Design, Chapman & Hall/CRC.
- Thomas M. Cover and Joy A. Thomas (1991) Elements of
Information Theory. John Wiley and Sons.
- Paul DuBois (2002) MySQL Cookbook.
O'Reilly.
- Peter Grunwald and Paul Vitanyi (Oct. 2004) Shannon Information
and Kolgmorov Complexity (PDF).
- Robert W. Lucky
(1989). Silicon Dreams: Information, Man and Machine. New York, St. Martin's
Press.
- David MacKay (2003) Information
Theory, Inference, and Learning Algorithms. Cambridge University
Press.
- David Mertz (2001) "A data
compression primer." IBM developWorks
- Markus Nix has developed a Text_Huffman package.
- Patrick Delin (2001) "Generating Web Page
Statistics using PHP." Zend Technologies Ltd.
- Patrick Delin (2001) "Generating Advanced
Web Page Statistics using PHP." Zend Technologies Ltd.
- Brian McConnell (2001) "Anticryptography: The Next
Frontier in Computer Science." O'Reilly publishing.
- PHPmath.com will host further updates to the IT Package.
Paul Meagher
is a cognitive scientist whose graduate studies focused on mathematical problem solving.
Return to the PHP DevCenter.