Lak11 Week 2: Rise of “Big Data” and Data Scientists

These are my reflection and thoughts on the second week of Learning and Knowledge analytics (Lak11). These notes are first an foremost to cement my own learning experience, so for everybody but me they might feel a bit disjointed.

What was week 2 about?

This week was an introduction to the topic of “big data”. As a result of all the exponential laws in computing, the amount of data that gets generated every single day is growing massively. New methods of dealing with the data deluge have cropped up in computer science.  Businesses, governments and scientists are learning how to use the data that is available to their advantage. Some people actually think this will fundamentally change our scientific method (like Chris Anderson in Wired).

Big data: Hadoop

Hadoop is one of these things that I heard a lot about without ever really understanding what it was. This Scoble interview with the CEO of Cloudera made things a lot clearer for me.

[youtube=http://www.youtube.com/watch?v=S9xnYBVqLws]

Here is the short version: Hadoop is a set of open source technologies (it is part of the Apache project) that allows anyone to do large scale distributed computing. The main parts of Hadoop are a distributed filesystem and a software framework for processing large data sets on clusters.

The technology is commoditised, imagination is what is needed now

The Hadoop story confirmed for me that this type of computing is already largely commoditised. The interesting problems in big data analytics are probably not technical anymore. What is needed isn’t more computing power, we need more imagination.

The MIT Sloan Management Review article titled Big Data, Analytics and the Path from Insights to Value says as much:

The adoption barriers that organizations face most are managerial and cultural rather than related to data and technology. The leading obstacle to wide-spread analytics adoption is lack of understanding of how to use analytics to improve the business, according to almost four of 10 respondents.

This means that we should start thinking much harder about what things we want to know that we couldn’t get before in a data-starved world. This means we have to start with the questions. From the same article:

Instead, organizations should start in what might seem like the middle of the pro-cess, implementing analytics by first defining the insights and questions needed to meet the big busi-ness objective and then identifying those pieces of data needed for answers.

I will therefore commit myself to try and formulate some questions that I would like to have answered. I think that Bert De Coutere’s use cases could be an interesting way of approaching this.

This BusinessWeek excerpt from Stephen Baker’s The Numerati gives some insight into where this direction will take us in the next couple of years. It profiles a mathematician at IBM, Haren, who is busy working on algorithms that help IBM match expertise to demand in real time, creating teams of people that would maximise profits. In the example, one of the deep experts takes a ten minute call while being on the skiing slopes. By doing that he:

[..] assumes his place in what Haren calls a virtual assembly line. “This is the equivalent of the industrial revolution for white-collar workers,”

Something to look forward to?

Data scientists, what skills are necessary?

This new way of working requires a new skill set. There was some discussion on this topic in the Moodle forums. I liked Drew Conway’s simple perspective, basically a data scientist needs to be on the intersection of Math & Statistics Knowledge, Substantive Expertise and Hacking Skills. I think that captures it quite well.

Data Science Venn Diagram (by Drew Conway)
Data Science Venn Diagram (by Drew Conway)

How many people do you know who could occupy that space? The How do I become a data scientist? question on Quora also has some very extensive answers as well.

Connecting connectivism with learning analytics

This week the third edition of the Connectivism and Connective Knowledge course has started too. George Siemens kicked of by posting a Connectivism Glossary.

It struck me that many of the terms that he used there are things that are easily quantifiable with Learning Analytics. Concepts like Amplification, Resonance, Synchronization, Information Diffusion and Influence are all things that could be turned into metrics for assessing the “knowledge health” of an organisation. Would it be an idea to get clearer and more common definitions of these metrics for use in an educational context?

Worries/concerns from the perspective of what technology wants

Probably the most lively discussion in the Moodle forums was around critiques of learning analytics. My main concern for analytics is the kind of feedback loop it introduces once you become public with the analytics. I expressed this in a reference to Goodhart’s law which states that:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes

George Siemens did a very good job in writing down the main concerns here. I will quote them in full for my future easy reference.

1. It reduces complexity down to numbers, thereby changing what we’re trying to understand
2. It sets the stage for the measurement becoming the target (standardized testing is a great example)
3. The uniqueness of being human (qualia, art, emotions) will be ignored as the focus turns to numbers. As Gombrich states in “The Story of Art”: The trouble about beauty is that tastes and standards of what is beautiful vary so much”. Even here, we can’t get away from this notion of weighting/valuing/defining/setting standards.
4. We’ll misjudge the balance between what computers do best…and what people do best (I’ve been harping for several years about this distinction as well as for understanding sensemaking through social and technological means).
5. Analytics can be gamed. And they will be.
6. Analytics favour concreteness over accepting ambiguity. Some questions dont have answers yet.
7. The number/quantitative bias is not capable of anticipating all events (black swans) or even accurately mapping to reality (Long Term Capital Management is a good example of “when quants fail”: http://en.wikipedia.org/wiki/Long-Term_Capital_Management )
8. Analytics serve administrators in organizations well and will influence the type of work that is done by faculty/employees (see this rather disturbing article of the KPI influence in universities in UK: http://www.nybooks.com/articles/archives/2011/jan/13/grim-threat-british-universities/?page=1 )
9. Analytics risk commoditizing learners and faculty – see the discussion on Texas A & M’s use of analytics to quantify faculty economic contributions to the institution: http://www.nybooks.com/articles/archives/2011/jan/13/grim-threat-british-universities/?page=2 ).
10. Ethics and privacy are significant issues. How can we address the value of analytics for individuals and organizations…and the inevitability that some uses of analytics will be borderline unethical?

This type of criticism could be enough for anybody to give up already and turn their back to this field of science. I personally belief that this would a grave mistake. You would be moving against the strong and steady direction of technology’s tendencies.

SNAPP: Social network analysis

The assignment of the week was to take a look at Social Networks Adapting Pedagogical Practice (better known as SNAPP) and use it on the Moodle forums of the course. Since I had already played with it before I only looked at Dave Cormier‘s video of his experience with the tool:

[youtube=http://www.youtube.com/watch?v=ZHNM8FWrpLk]

Snapp’s website gives a good overview of some of the things that a tool like this can be used for. Think about finding disconnected or at-risk students, seeing who are the key information brokers in the class, use it for “before and after” snapshots of a particular intervention, etc.

Before I was able to use it inside my organisation I needed to make sure that the tool does not send any of the data it scrapes back home to the creators of the software (why wouldn’t it, it is a research project after all). I had an exchange with Lori Lockyer, professor at Wollongong, who assured me that:

SNAPP locally complies the data in your Moodle discussion forum but it does not send data from the server (where the discussion forum is hosted) to the local machine nor does it send data from the local machine to the server.

Making social networks inside applications (and ultimately inside organisations) more visible to many more people using standard interfaces is a nice future to look forward to. Which LMS is the first to have these types of graphs next to their forum posts? Which LMS will export graphs in some standard format for further processing with tools like Gephi?

Gephi is one of the tools by the way, that I really should start to experiment with sooner rather than later.

The intelligent spammability of open online courses: where are the vendors?

One thing that I have been thinking about in relation to these Open Online Courses is how easy it would be for vendors of particular related software products to come and crash the party. The open nature of these courses lends itself to spam I would say.

Doing this in an obnoxious way will ultimately not help you with this critical crowd, but being part of the conversation (Cluetrain anybody?) could be hugely beneficial from a commercial point of view. As a marketeer where else would you find as many people deeply interested into Learning Analytics as in this course? Will these people not be the influencers in this space in the near future?

So where are the vendors? Do you think they are lurking, or am I overstating the opportunity that lies here for them?

My participation in numbers

Every week I give a numerical update about my course participation (I do this in the spirit of the quantified self, as a motivator and because it seems fitting for the topic). This week I bookmarked 37 items on Diigo, wrote 3 Lak11 related tweets, wrote 5 Moodle forum posts and 1 blog post.