Subscribe to Dr. Granville's Weekly Digest

My thoughts on big data and data science: no, it's not hype

Denying that big data is a new paradigm (post year 2000) is like saying that the human population has been huge for a long time: if we can handle 10 million human beings as we did a few thousand years ago, we can handle 10 billion today the same way, even one trillion. It's the same as saying that data flowing at 10 million rows per day can be processed and analyzed the same way as 10 billion or one trillion per day, which (billions per day) is common in transaction data (credit cards), mobile, web traffic, sensor data, retail data, health data, NSA, NASA, stock trading and many more.

Each time a credit card is swiped or processed online, an analytic algorithm is used to detect if it's fraudulent or not (and the answer must come in less than 3 seconds most of the time, with low false negative rate). Each time you do a Google search, an analytic engine determines witch search results to show you, and which ads to display. Each time someone posts something on Facebook, an analytic algorithm is run to determine if it must be rejected (promotion, spam, porn etc) or not. Each Tweet posted is analyzed by analytic algorithms (designed by a number of various companies) to detect new viral trends (for journalists), or disease spread, intelligence leaks or many other things. Each time you browse Amazon, the customized content delivered to you is analytically "calculated" to optimize Amazon's revenue. Each time an email is sent, an analytic algorithms decides whether or not to put it in your spam box (that's intensive computations for Gmail). This is analytic at billions of rows per day. Evidently there is a gigantic amount of pre-computations and look-up tables being used to make this happens, but it still is "big data analytics". The analytic engineer knows that his Ad matching algorithm must use the right metrics, right look-up tables (that he should help design, if not automatically populate) to do a great computation (as best as possible) given the finite memory resources and the speed at which the results are delivered, typically measured in milliseconds. You just can't separate the two processes: data flow, and analytics or data science. Indeed the word "data science" conveys the idea that data and analytics are bedfellows.

Also, big data practitioners working for start-ups usually wear multiple hats: data engineer, business analyst and machine learning / statistics / analytics engineer. The term "data scientist" suits them really well.

Finally, even with transactional data, if you want to split the data scientist role (in large companies) in silos - data versus analytics or business engineers, there is still an important issue: sampling. Analytics engineers can work on samples, but how small, how big or how good? Who determines what makes a good sample? Again, you need to be a data scientist to solve these questions, and the answer is: samples must be far bigger than you think (100 million rows in the contexts described above) and also much better selected. I have worked with an Ad network company managing truly big data. They sent me a sample with about 3 million clicks. But it did not have a rich set of affiliate data (that is, many affiliates with enough data for each of them) that I could not clearly identify instances of affiliates collusion (a scheme leveraging Botnets to share hijacked IP addresses among affiliates, for click fraud). I needed 50 million rows (clicks) to clearly identify this type of massive (but low frequency) fraud. This raises three questions:

  • If you are provided with a 3 million rows sample for your statistical analyses, it might be too small for you to notice some patterns. You will miss many important signals well buried in the full data, and won't know what you are missing.
  • If (in my case) using 50 million rows (rather than 3 million) helps me detect lots of new interesting, valuable stuff, what if my sample had 500 million rows instead? I might discover even more, who knows?
  • At some point, increasing sample size to an even bigger number, brings diminishing returns. A one billion rows sample might not provide much additional value (except maybe if it is data sampled over a 12 months period rather than two weeks) than a 100 million rows sample. Interestingly, in this case, obtaining advertiser data (with conversions) rather than Ad network data is a great alternative (combining both advertiser and ad network data is even better), even it it means creating dummy (honeypot) advertiser accounts to monitor fraud. It then becomes an experimental design project, and a 100,000 rows data set might be enough. It is the data scientist responsibility to think about and propose an implementation of dummy advertiser accounts to solve the problem, leveraging both his/her statistical, big data, and domain expertise.

My point here is that samples, traditionally involving less than 10 million observations, are really far too small in a number of applications, or the wrong data is being used. Samples with 200 million rows might prove like a good compromise sometimes. This is true in data that can be segmented in millions of small buckets, and you need statistical significance in as many buckets as possible. But you can not apply the same statistical techniques to a 200 million rows data set, than to a 10 million rows data set, because of the curse of big data. Google my article "The Curse of Big Data" as it explains the problem and provides a solution - interestingly the solution is as much a data solution than a statistical solution, thus the word "data scientist" (rather than "statisticians") to describe people working on such projects.

Related article

Views: 2860

Comment by Alfred Ji-Ping Lin on January 20, 2014 at 12:37am

Agree with Vincent thought that big data & data science are not hype. I hold a personal thought from my experience of processing & analyzing hundreds of billions individual records of health data in Taiwan: you will get the same result whatever statistical predictive modelling you use, as long as your use big data; on the other hand, your analyzing result will depend a lot on what statistical model you use if your data are not large enough. In brief, don't forget the very essence of "Law of Large Numbers" which already tells us big data & data science are not hype.

Comment

You need to be a member of Big Data News to add comments!

Join Big Data News

© 2014   BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap   Powered by

Badges  |  Report an Issue  |  Terms of Service