I don’t really have any sections on my website related to MySQL, so I figured I’d just throw together a quick post about something I found incredibly helpful. I’m sure others will find it nice too.
So in my years being trained on data architecture and database design, I learned to "normalize, normalize, normalize". Unfortunately, normalization doesn’t lend itself to very simple statistical calculations–such as those used to generate graphs or a statistical scorecard.
In the case of Angel, I’ve normalized the hell out of the data. Now it comes down to needing to report on that data. At first, my reporting attempts were terrible–tons of nested loops, many queries, etc. Well, when I spoke to a colleague of mine at work, he said that if I was running on an SAS Mainframe, the work would already be done for me using a function called "PROC TABULATE". He sort of dated himself by saying that.
Then I broke the news to him: I’m not using SAS, or any other mainframe technology. He replied by suggesting I look into some sort of MySQL equivalent to that function. So I started hunting. I came across this very nice article: http://dev.mysql.com/tech-resources/articles/wizard/print_version.html
That article shows, very effectively, how to create a nice "statistical" report using cross-tabulation techniques with normalized source data. The queries are sort of monstrous if you have a large number of fields to report on, but at that point, you’d be using code to generate the parameters for the SELECT statement.
Here’s my example:
My source data looks like the following–please note that some results are truncated because the result set would be WAY too large for a blog:
mysql> select uid, Email from players;
+-----+--------------------------+
| uid | Email |
+-----+--------------------------+
| 10 | [email protected] |
| 4 | [email protected] |
| 1 | [email protected] |
| 9 | [email protected] |
| 11 | [email protected] |
| 5 | [email protected] |
| 8 | [email protected] |
| 12 | [email protected] |
| 6 | [email protected] |
| 7 | [email protected] |
+-----+--------------------------+
mysql> select * from gamedata LIMIT 10;
+----------+---------+------+
| playerid | fieldid | data |
+----------+---------+------+
| 1 | 3 | 60 |
| 1 | 4 | 1 |
| 1 | 5 | 1 |
| 1 | 6 | 286 |
| 1 | 7 | 1 |
| 1 | 8 | 13 |
| 1 | 9 | 1 |
| 1 | 10 | 1 |
| 1 | 11 | 261 |
| 1 | 12 | 13 |
+----------+---------+------+
mysql> select FieldID, FriendlyName, GameID, FieldTypeID FROM gamefields;
+---------+---------------------------------+--------+-------------+
| FieldID | FriendlyName | GameID | FieldTypeID |
+---------+---------------------------------+--------+-------------+
| 1 | Play Time | 1 | 2 |
| 2 | Last Character | 1 | 1 |
| 3 | Total Eggs | 4 | 2 |
| 4 | Total Playtime | 4 | 2 |
| 5 | Total King Of The Hill Captures | 4 | 2 |
| 6 | Total Kills | 4 | 2 |
| 7 | Total Deaths | 4 | 2 |
| 8 | Engineer Playtime | 4 | 2 |
| 10 | Mongo Playtime | 4 | 2 |
| 11 | Grunt Playtime | 4 | 2 |
| 12 | Gladiator Playtime | 4 | 2 |
| 13 | Total Powerup Uses | 4 | 2 |
| 14 | Total Powerups Received | 4 | 2 |
+---------+---------------------------------+--------+-------------+
mysql> select typeid, typename from field_types;
+--------+------------------+
| typeid | typename |
+--------+------------------+
| 1 | Text |
| 2 | Numeric |
| 3 | Game Coordinates |
| 4 | Boolean |
| 5 | Physical Address |
| 6 | IP Address |
+--------+------------------+
What I want to know from that data is how many players have field values within specific ranges–to generate some input data for a chart output. The code I’m using does the following math (from a separate query):
$StandardDeviation = floor( ($High - $Low)/NUM_DEVIATION_RANGES );
for $i = 0 to NUM_DEVIATION_RANGES
$Min = $Low + ($StandardDeviation * $i)
$Max = ($Min + $StandardDeviation) – 1;
$Query .= “SUM( IF( CAST(gamedata.Data AS UNSIGNED) >= $Min AND CAST(gamedata.Data AS UNSIGNED) <= $Max, 1, 0) ) AS ‘$Min-$Max’,”;
end for
The output of the query from step 4 is as follows:
Now that query runs once per number of fields requested. The above query was just a single query for a single field. Note how the code generated the SELECT statement parameters (min and max) and set them as the column header in the result?
The result of that query looks like this:
+-------+--------+---------+---------+---------+---------+---------+---------+---------+---------+ | 60-87 | 88-115 | 116-143 | 144-171 | 172-199 | 200-227 | 228-255 | 256-283 | 284-311 | 312-339 | +-------+--------+---------+---------+---------+---------+---------+---------+---------+---------+ | 2 | 0 | 2 | 0 | 1 | 0 | 2 | 0 | 1 | 0 | +-------+--------+---------+---------+---------+---------+---------+---------+---------+---------+
So the output is exactly what I need for my chart. The result fields represent the X axis of the chart, and the result values represent the Y axis of the chart, all from a single query against normalized data using cross-tabulation.
I urge you to click on the link I mentioned near the top of the article if you want to learn how to do this. My example is fairly specific to my situation. That article showed me exactly what I needed to do, but I needed to apply it to my data and my reports. You’ll notice that I’ve used CAST’s in my query. The reason being that in order to accept any type of data, I’ve used the ‘TEXT’ type on my game data fields–that’s why I also have a type table and a type ID in the field types. I then query what types of fields is numeric, and only perform my calculations against them.
Doing cross-tabluation/pivot tables isn’t too bad once you know what you’re doing. I had a tough time wrapping my head around it, but after a few hours of tinkering, I finally got what I was looking for. Another thing to note is that I’m summing up 1′s and 0′s, basically a tally. If you want to sum up your real data, don’t use 1′s and 0′s; use your actual data field.
Comments are closed.