data analytics. data mining. business intelligence (sometimes an oxymoron). data crunching. it’s all, more or less, the same thing: a meaningful look at – and into – your vast amount collected data. let’s stick with data analytics for now and call it, DA for short.
DA on any type of data – traffic, airlines, crime, world hunger – can be a daunting task. you have billions of data points (i.e. meters, sensors, RTUs) with each one providing data elements in the 10’s – 100’s range. you may get that data only once per day, but then again, it may come at intervals every 60 minutes, 30 minutes, 15, 5. SCADA folks are used to data elements every 2-4 seconds and i’ve worked with n-iary data streams coming in at 60 frames per seconds (that’s more than most movie streams @ 1080p!).
the volume of data received and stored is staggering; that’s why it’s sometimes called Big Data. the challenges of what to do with it can also be staggering:
- network and communication architectures – because your mom’s ethernet platform aint gonna cut it.
- database storage – i should take a poll on this sometime because i’ll wager 90% of data architectures, in-place or being considered today, are wrong.
- server architectures – this one is more often right but it’s here because you’ll need a lot of them and they cost money.
- business use case scenarios – if you haven’t yet spent time defining these… and you didn’t begin here in the first place… you’re already doomed.
- privacy and security – who owns the data, is it private or public, can it reveal a little too much about any one customer, is there some oversight on data?
a few years ago i designed a data model for intel’s foray into the home energy management space. it was a robust and comprehensive data architecture. not just energy data, but included data from sensors and actuators regarding pool/spa water chemistry, intrusion alarming, door/window closure states, washer/dryer cycle states, electric vehicle charging, HVAC, and much much more. everything you could imagine. problem was this: simulated analytics of the data collected from such a smart home, revealed to much!
against my data model, and armed with a basic query a trained parrot with a stiff beak could write, i could know far too much about a home’s occupants: when they were home and when not, when someone was in the shower or bath or pool (and predictably what bedroom they used), when they were sleeping and what time they awoke and all left for work or school.
in order to launch a DA initiative successfully, it’s important to baseline your definition of what DA means and categorize conceptual solutions into manageable groups. what’s that stupid saying about eating an elephant? for instance, does DA mean…
- statistical modeling? What type?
- look back, look forward or both?
- open access to raw data for individual or ad hoc usage?
- published pre-defined reporting for controlled distribution?
- customized platform for all the above and more?
- is an enterprise solution to be considered (i.e. centralized engine with distributed users)?
- will a COTS solution suffice?
- is a home-grown solution viable (consider design, development, maintenance, administration, etc.)?
with this simple baseline, we then define a logical approach to perform 4 basic tasks:
1) define the overall objective – understanding the objective frames the depth and breadth of potential solutions. we codify this objective by looking at two fundamental aspects of the initiative.
A) Understand the answer to these questions:
who are the consumers of resulting data analytics?
who will perform the analysis?
what data do we want to analyze?
how will the resulting analytics be used?
when is data relevant (e.g. real-time, latent, does your data have a shelf-life)?
where is the data coming from (i.e. is it locationaly relevant)?
where is the data going?
B) Catalog the use case scenarios with actors, actions, inputs, and outputs:
actors help define the who’s, which gives us a picture of the scope of resources involved (e.g. basic users, power users, administrators, etc.). actions help define the how’s, which helps us understand integration and GUI tools and functionality. inputs get us looking at what data will be required – not just the obvious – but reference data, master data, metadata, correlated data, and comparative data. these help define the what’s, where’s, and when’s. outputs help answer the same questions as input data, but for obvious different reasons.
2) define restrictions – remember the saying: good, fast, cheap; pick two. every organization has restrictions whether they are related to capital, resources, technology, legal, policy, whatever. an honest and complete understanding of constraints will help narrow proposed solutions and reduce wasted time. the most common i run into are these:
maintainability – will/can IT support this platform?
availability – what is my minimal downtime and what are my performance needs?
scalability – what are the capacity requirements today and can the platform scale when needed?
budget – self explanatory.
3) narrow the conceptual solution to a manageable subset of technologies and perform a basic gap analysis against what features/requirements/capabilities each potential solution offers. for instance, an article from information week highlights some of the players in this space. i’m agnostic with respect to software and solution providers so this is not an endorsement.
4) draft a proposed implementation approach and a short-list of solutions. collaboratively establish a set of measurable criteria for each solution (pros v. cons) in whatever format communicates best to your audience. just remember that it needs to answer the big three: this is what we get, this what we don’t, and this is how much it costs.
lastly, if an in-house solution is selected, plan the full life-cycle and multiply your costs and estimated work schedule by 1.75. at the thirty-thousand foot level, every in-house project needs at least these phases:
on a closing note, be cautious and skeptical about DA platforms and suites. many are good and robust but others can be just a slick UI over top of a relational database architecture (RDBMS). conceptually speaking, a proper repository architected to meet anticipated needs for data analytics would look something like this:
don’t be afraid to ask yourself or your vendor tough questions. most of all, don’t be afraid of the answers.