Notes on the big-data census exploration

script file: census.demo.ssc

Uses big data routines to read the file P0008.by5.1.csv, and use K-means
clustering to look for patterns.

Table P8 tallies populations by sex, by age. There are 18 age bins so 36 bins in
all.

Raw data come in as counts of people per bin for each ZCTA region.

Pre-processing takes these steps.

1. A global average is formed, and graphed.  Note that one would normally sum but
big-data column operations don't include sum so average is used.  A custom
horizontal bar plot is shown for the population as a whole.

2. Each row (ZCTA) is normalized (columns with "N" at the end) so that they
contain fractions rather than absolute counts. This way, different areas with
different populations, can be compared.

3. Each row is then compared to the "standard" opoulation, to normalize for
overall demographics. The resulting variables have a "Nz" designation.  These
numbers are currently off by a (large) constant but the shapes will be correct.


# Processing - here's where you can see the bd operations:

Clustering-----------
A large number (50) of clusters is needed to tease out
interesting patterns.  There are a lot of "junk" clusters found in the process.

Aggregation is then done to count and tally (two separate aggregations)

Plotting ----------
This example includes a custom barplot routine.  The "N=" title tells how many rows are
in each cluster.

