Lak11 Week 2: Rise of “Big Data” and Data Scientists

These are my reflection and thoughts on the second week of Learning and Knowledge analytics (Lak11). These notes are first an foremost to cement my own learning experience, so for everybody but me they might feel a bit disjointed.

What was week 2 about?

This week was an introduction to the topic of “big data”. As a result of all the exponential laws in computing, the amount of data that gets generated every single day is growing massively. New methods of dealing with the data deluge have cropped up in computer science. Businesses, governments and scientists are learning how to use the data that is available to their advantage. Some people actually think this will fundamentally change our scientific method (like Chris Anderson in Wired).

Big data: Hadoop

Hadoop is one of these things that I heard a lot about without ever really understanding what it was. This Scoble interview with the CEO of Cloudera made things a lot clearer for me.

[youtube=http://www.youtube.com/watch?v=S9xnYBVqLws]

Here is the short version: Hadoop is a set of open source technologies (it is part of the Apache project) that allows anyone to do large scale distributed computing. The main parts of Hadoop are a distributed filesystem and a software framework for processing large data sets on clusters.

The technology is commoditised, imagination is what is needed now

The Hadoop story confirmed for me that this type of computing is already largely commoditised. The interesting problems in big data analytics are probably not technical anymore. What is needed isn’t more computing power, we need more imagination.

The MIT Sloan Management Review article titled Big Data, Analytics and the Path from Insights to Value says as much:

The adoption barriers that organizations face most are managerial and cultural rather than related to data and technology. The leading obstacle to wide-spread analytics adoption is lack of understanding of how to use analytics to improve the business, according to almost four of 10 respondents.

This means that we should start thinking much harder about what things we want to know that we couldn’t get before in a data-starved world. This means we have to start with the questions. From the same article:

Instead, organizations should start in what might seem like the middle of the pro-cess, implementing analytics by first defining the insights and questions needed to meet the big busi-ness objective and then identifying those pieces of data needed for answers.

I will therefore commit myself to try and formulate some questions that I would like to have answered. I think that Bert De Coutere’s use cases could be an interesting way of approaching this.

This BusinessWeek excerpt from Stephen Baker’s The Numerati gives some insight into where this direction will take us in the next couple of years. It profiles a mathematician at IBM, Haren, who is busy working on algorithms that help IBM match expertise to demand in real time, creating teams of people that would maximise profits. In the example, one of the deep experts takes a ten minute call while being on the skiing slopes. By doing that he:

[..] assumes his place in what Haren calls a virtual assembly line. “This is the equivalent of the industrial revolution for white-collar workers,”

Something to look forward to?

Data scientists, what skills are necessary?

This new way of working requires a new skill set. There was some discussion on this topic in the Moodle forums. I liked Drew Conway’s simple perspective, basically a data scientist needs to be on the intersection of Math & Statistics Knowledge, Substantive Expertise and Hacking Skills. I think that captures it quite well.

Data Science Venn Diagram (by Drew Conway)

How many people do you know who could occupy that space? The How do I become a data scientist? question on Quora also has some very extensive answers as well.

Connecting connectivism with learning analytics

This week the third edition of the Connectivism and Connective Knowledge course has started too. George Siemens kicked of by posting a Connectivism Glossary.

It struck me that many of the terms that he used there are things that are easily quantifiable with Learning Analytics. Concepts like Amplification, Resonance, Synchronization, Information Diffusion and Influence are all things that could be turned into metrics for assessing the “knowledge health” of an organisation. Would it be an idea to get clearer and more common definitions of these metrics for use in an educational context?

Worries/concerns from the perspective of what technology wants

Probably the most lively discussion in the Moodle forums was around critiques of learning analytics. My main concern for analytics is the kind of feedback loop it introduces once you become public with the analytics. I expressed this in a reference to Goodhart’s law which states that:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes

George Siemens did a very good job in writing down the main concerns here. I will quote them in full for my future easy reference.

1. It reduces complexity down to numbers, thereby changing what we’re trying to understand
2. It sets the stage for the measurement becoming the target (standardized testing is a great example)
3. The uniqueness of being human (qualia, art, emotions) will be ignored as the focus turns to numbers. As Gombrich states in “The Story of Art”: The trouble about beauty is that tastes and standards of what is beautiful vary so much”. Even here, we can’t get away from this notion of weighting/valuing/defining/setting standards.
4. We’ll misjudge the balance between what computers do best…and what people do best (I’ve been harping for several years about this distinction as well as for understanding sensemaking through social and technological means).
5. Analytics can be gamed. And they will be.
6. Analytics favour concreteness over accepting ambiguity. Some questions dont have answers yet.
7. The number/quantitative bias is not capable of anticipating all events (black swans) or even accurately mapping to reality (Long Term Capital Management is a good example of “when quants fail”: http://en.wikipedia.org/wiki/Long-Term_Capital_Management )
8. Analytics serve administrators in organizations well and will influence the type of work that is done by faculty/employees (see this rather disturbing article of the KPI influence in universities in UK: http://www.nybooks.com/articles/archives/2011/jan/13/grim-threat-british-universities/?page=1 )
9. Analytics risk commoditizing learners and faculty – see the discussion on Texas A & M’s use of analytics to quantify faculty economic contributions to the institution: http://www.nybooks.com/articles/archives/2011/jan/13/grim-threat-british-universities/?page=2 ).
10. Ethics and privacy are significant issues. How can we address the value of analytics for individuals and organizations…and the inevitability that some uses of analytics will be borderline unethical?

This type of criticism could be enough for anybody to give up already and turn their back to this field of science. I personally belief that this would a grave mistake. You would be moving against the strong and steady direction of technology’s tendencies.

SNAPP: Social network analysis

The assignment of the week was to take a look at Social Networks Adapting Pedagogical Practice (better known as SNAPP) and use it on the Moodle forums of the course. Since I had already played with it before I only looked at Dave Cormier‘s video of his experience with the tool:

[youtube=http://www.youtube.com/watch?v=ZHNM8FWrpLk]

Snapp’s website gives a good overview of some of the things that a tool like this can be used for. Think about finding disconnected or at-risk students, seeing who are the key information brokers in the class, use it for “before and after” snapshots of a particular intervention, etc.

Before I was able to use it inside my organisation I needed to make sure that the tool does not send any of the data it scrapes back home to the creators of the software (why wouldn’t it, it is a research project after all). I had an exchange with Lori Lockyer, professor at Wollongong, who assured me that:

SNAPP locally complies the data in your Moodle discussion forum but it does not send data from the server (where the discussion forum is hosted) to the local machine nor does it send data from the local machine to the server.

Making social networks inside applications (and ultimately inside organisations) more visible to many more people using standard interfaces is a nice future to look forward to. Which LMS is the first to have these types of graphs next to their forum posts? Which LMS will export graphs in some standard format for further processing with tools like Gephi?

Gephi is one of the tools by the way, that I really should start to experiment with sooner rather than later.

The intelligent spammability of open online courses: where are the vendors?

One thing that I have been thinking about in relation to these Open Online Courses is how easy it would be for vendors of particular related software products to come and crash the party. The open nature of these courses lends itself to spam I would say.

Doing this in an obnoxious way will ultimately not help you with this critical crowd, but being part of the conversation (Cluetrain anybody?) could be hugely beneficial from a commercial point of view. As a marketeer where else would you find as many people deeply interested into Learning Analytics as in this course? Will these people not be the influencers in this space in the near future?

So where are the vendors? Do you think they are lurking, or am I overstating the opportunity that lies here for them?

My participation in numbers

Every week I give a numerical update about my course participation (I do this in the spirit of the quantified self, as a motivator and because it seems fitting for the topic). This week I bookmarked 37 items on Diigo, wrote 3 Lak11 related tweets, wrote 5 Moodle forum posts and 1 blog post.

Lak11 Week 1: Introduction to Learning and Knowledge Analytics

Every week I will try and write down some reflections on the Open Online Course: Learning and Knowledge Analytics. These will by written for myself as much as for anybody else, so I have to apologise in advance about the fact that there will be nearly no narrative and a mix between thoughts on the contents of the course and on the process of the course.

So what do I have to write about this week?

My tooling for the course

There is a lot of stuff happening in these distributed courses and keeping up with the course required some setup and preparation on my side (I like to call that my “tooling”). So what tools do I use?

A lot of new materials to read are created every day: Tweets with the #lak11 hashtag, posts in all the different Moodle forums, Google groups and Learninganalytics.net messages from George Siemens and Diigo/Delicious bookmarks. Thankfully all of these information resources create RSS feeds and I have been able to add them all to special-made Lak11 folder in my Google Reader (RSS feed). That folder sorts its messages based on time (oldest first) allowing me some understanding of the temporal aspects of the course and making sure I read a reply after the original message. A couple of times a day I use the excellent MobileRSS reader on my iPad to read through all the messages.

There is quite a lot of reading to do. At the beginning of the week I read through the syllabus and make sure that I download all the PDF files to GoodReader on the iPad. All web articles are stored for later reading using the Instapaper service. I have given both GoodReader and Instapaper Lak11 folders. I do most of the reading of these articles on the train. GoodReader allows me to highlight passages and store bookmarks in the PDF file itself. With Instapaper thus is a bit more difficult: when I read a very interesting paragraph I have to highlight it and email it to myself for later processing.

Each and every resource that I touch for the course gets its own bookmark on Diigo. Next to the relevant tags for the resource I also tag them with lak11 and weekx (where x is the number of the week) and share them to the Learning Analytics group on Diigo. These will provide me with a history of the interesting things I have seen during the course and should help me in writing a weekly reflective post.

So far the “consumer” side of things. As a “producer” I participate in the Moodle forums. I can easily find back all my own posts through my Moodle profile and I hope to use some form of screen-scraper at the end of the course to pull a copy of everything that I have written. I use this Worpress.com hosted blog to write and reflect on the course materials and tag my course-related post with “lak11” so that show up on their own page (and have their own feed in case you are interested). On Twitter I occasionally tweet with #lak11, mostly to refer to a Moodle- or blog post that I have written or to try and ask the group a direct question.

What is missing? The one thing that I don’t use yet is something like a mind mapping or a concept mapping tool. The syllabus recommends VUE and CMAP and one of the assignments each week is to keep updating a map for the course. These tools don’t seem to have an iPad equivalent. There is some good mind mapping tools for the iPad (my favourite is probably iThoughtsHD, watch this space for a mind mapping comparison of iPad apps), but I don’t seem to be able to add using it into my workflow for the course. Maybe I should just try a little harder.

My inability to “skim and dive”

This week I reconfirmed my inability to “skim and dive”. For these things I seem to be an all or nothing guy. There are magazines that I read completely from the first page to the last page (e.g. Wired). This course seems to be one of these things too. I read every single thing. It is a bit much currently, but I expect the volume of Moodle and Twitter messages to go down quite significantly as the course progresses. So if I can just about manage now, it should become relatively easy later on.

The readings of this week

There were quite a few academic papers in the readings of this week. Most of them provided an overview of education datamining or academic/learning analytics. Many of the discussions in these papers seemed quite nominal to me. They probably are good references to keep and have a wealth of bibliographical materials that I could look at at some point in the future. For now, they lacked any true new insights for me and appeared to be pretty basic.

Live sessions

Unfortunately I wasn’t able to attend any of the Elluminate sessions and I haven’t listened to them yet either. I hope to catch up this week with the recordings and maybe even attend the guest speaker live tomorrow evening.

Marginalia

It has been a while since I last actively participated in a Moodle facilitated course. Moodle has again proven to be a very effective host for forum based discussions. One interesting Moodle add-on that I had not seen before is Marginalia a way to annotate forum posts in Moodle itself which can be private or public. Look at the following Youtube video to see it in action.

[blip.tv ?posts_id=4054581&dest=-1]

I wonder if I will use it extensively in the next few weeks.

Hunch

One thing that we were asked to try out as an activity was Hunch. For me it was interesting to see all the different interpretations that people in the course had about how to pick up this task and what the question (What are the educational uses of a Hunch-like tool for learning?) actually meant. A distributed course like this creates a lot of redundancy in the answers. I also noted that people kept repeating a falsehood (needing to use Twitter/Facebook to log in). My explanation of how Hunch could be used by the weary was not really picked up. It is good to be reminded at times that most people in the world do not share my perspective on computers and my literacy with the medium. Thinking otherwise is a hard to escape consequence of living in a techno-bubble with the other “digerati”.

I wrote the following on the topic (in the Moodle forum for week 1):

Indeed the complete US-centricness of the service was the first thing that I noticed. I believe it asked me at some point on what continent I am living. How come it still asks me questions to which I would never have an answer? Are these questions crowdsourced too? Do we get them randomly or do we get certain questions based on our answers? It feels like the former to me.

The recommendations that it gave me seemed to be pretty random too. The occasional hit and then a lot of misses. I had the ambition to try out the top 5 music albums it would recommend me, but couldn’t bear the thought of listening to all that rock. This did sneak a little thought into my head: could it be that I am very special? Am I so eclectic that I can defeat all data mining effort. Am I the Napoleon Dynamite of people? Of course I am not, but the question remains: does this work better for some people than for others.

One other thing that I noticed how the site seemed to use some of the tricks of an astrologer: who wouldn’t like “Insalata Caprese”, seems like a safe recommendation to me.

In the learning domain I could see an application as an Electronic Performance Support System. It would know what I need in my work and could recommend the right website to order business cards (when it sees I go to a conference) or an interesting resource relating to the work that I am doing. Kind of like a new version of Clippy, but one that works.

BTW, In an earlier blogpost I have written about how recommendation systems could turn us all into mussels (although I don’t really believe that).

Corporate represent!

Because of a very good intervention by George Siemens, the main facilitator of the course, we are now starting to have a good discussion about analytics in corporate situations here. The corporate world has learning as a secondary process (very much as a means to a goal) and that creates a slightly different viewpoint. I assume the corporate people will form their own subgroup in some way in this course. Before the end of next week I will attempt to flesh out some more use cases following Bert De Coutere’s examples here.

Bersin/KnowledgeAdvisors Lunch and Learn

At the end of January I will be attending a free Bersin/KnowledgeAdvisors lunch and learn titled Innovation in Learning Measurement – High Impact Measurement Framework in London (this is one day before the Learning Technologies 2011 exhibit/conference). I would love to meet other Lak11 participants there. Will that happen?

My participation in numbers

Every week I will try and give a numerical update about my course participation. This week I bookmarked 33 items on Diigo, wrote 10 Lak11 related tweets, wrote 25 Moodle forums post and 2 blog posts.

Workflow Driven Apps Versus App Driven Workflow

Arjen Vrielink and I write a monthly series titled: Parallax. We both agree on a title for the post and on some other arbitrary restrictions to induce our creative process. This month we write about how the constant flux of new apps and platforms influences your workflow. We do this by (re-)viewing our workflow from different perspectives. After a general introduction we write a paragraph of 200 words each from the perspective of 1. apps, 2. platform and 3. workflow itself. You can read Arjen’s post with the same title here.

To me a workflow is about two things mainly: the ability to capture things and the ability to time-shift. Both of these need to be done effectively and efficiently. So let’s take a look at three separate processes and see how they currently work for me: task/todo management, sharing with others and reading news and interesting articles (not books). So how do I work nowadays for each of these three things?

Workflow
I use Toodledo for my task/todo management. Whenever I “take an action” or think of something that I need to do at some point in the future I fire up Toodledo and jot it down. Each item is put in a folder (private, work, etc.), gets a due date (sometimes with a timed reminder to email if I really cannot forget to do it) and is given a priority (which I usually ignore). At the beginning and end of every day I run through all the tasks and decide in my head what will get done.

For me it important to share what I encounter on the web and my thoughts about that with the rest of the world. I do this in a couple of different ways: explicitly through Twitter, through Twitter by using a Bit.ly sidebar in my Browser, in Yammer if it is purely for work, on this WordPress.com blog, through public bookmarks on Diigo, by sending a direct email or by clicking the share button in Google Reader.

I have subscribed to 300+ RSS feeds and often when I am scanning them and find something interesting and I don’t have the opportunity to read it at that time. I use Instapaper to capture these articles and make them available for easy reading later on. Instapaper doesn’t work with PDF based articles so I send those to a special email address so that I can pick them up with my iPad and save them to GoodReader when it is convenient.

Platform
“Platform” can have multiple meanings. The operating system was often called a platform. When you heavily invested into one platform it would become difficult to do any of your workflows with a different platform (at my employer this has been the case for many years with Microsoft and Exchange: hard to use anything else). Rich web applications have now turned the Internet itself into a workflow platform. This makes the choice for an operating system nearly, if not totally, irrelevant. I regularly use Ubuntu (10.04, too lazy to upgrade so far), Windows Vista (at work) and iOS (both on the iPhone and the iPad). All of the products and services mentioned either have specialised applications for the platform or are usable through any modern web browser. The model I prefer right now is one where there is transparent two-way synching between a central server/service and the different local apps, allowing me access to my latest information even if I am not online (Dropbox for example uses this model and is wonderful).

What I have noticed though, is that I have strong preferences for using a particular platform (actually a particular device) for doing certain tasks. The iPad is my preference for any reading of news or of articles: the “paginate” option on Instapaper is beautiful. Sharing is best done with something that has a decent keyboard and Toodledo is probably used the most with my iPhone because that is usually closest at hand.

Apps
Sharing is a good example of something where the app drives my behaviour very much: the app where I initially encounter the thing I want to share needs to support the sharing means of choice. This isn’t optimal at all: if I read something interesting in MobileRSS on the iPad that I want to share on Yammer, then I usually email the link from MobileRSS to my work email address, once at work I copy it from my mail client into the Browser version of Yammer and add my comments. This is mainly because Yammer (necessarily) has to be a closed off to the rest of the world with its APIs.

Services that create the least hickups in my workflow are those that have a large separation between the content/data of the service and the interface. Google Reader and Toodledo both provide very complete APIs that allow anybody to create an app that accesses the data and displays it in a smart way. The disadvantage of these services is that I am usually dependent on a single provider for the data. In the long term this is probably not sustainable. Things like Unhosted are already pointing to the future: an even stricter separation between data and app. Maybe in that future, the workflow can start driving the app instead of the other way around.

Learning and Knowledge Analytics 2011: I Will Participate

Mining Social Networks (The Economist) — Mining Social Networks (The Economist/Andy J. Miller)

George Siemens has written about the upcoming Learning and Knowledge Analytics 2011 course (#lak11). After reading the very interesting draft syllabus I have decided to actively participate. This means you should be seeing reflections about the course in this very blog soon. The dedicated Moodle site for the course asks participants to introduce themselves and write about their course expectations. I have posted the following:

I am a 34 year old guy from Amsterdam in the Netherlands. I work as the “Innovation Manager for Global Learning Technologies” at Shell International (at the headquarters in The Hague). Before this job I was heavily involved with the Moodle project as an e-learning consultant working for the Dutch Moodle Partner (Stoas Learning). Before that I was a teacher at a high school in Amsterdam (I taught PE and project based education).

I love technology and am deeply interested in how it affects society. One of my business cards uses my favourite quote (from Yochai Benkler): “Technology creates feasibility spaces for social practice” (see here for more context). To me, this open course is an example too of a practice enabled by technological possibilities.

My blog can be found at http://blog.hansdezwart.info and you should also find links to my other social networking presences there. I try to blog regularly and what I write on this course is here.

I intend to actively participate in this course. For me this means:

Spending time to read and annotate all the course materials during my commute (1.5 hours each way) on my iPad.

Writing reflections at least once a week on my blog

Doing all the suggested activities and participate actively in the Moodle forums.

Try to attend the weekly live Elluminate sessions (if the timezone agrees with my schedule) or at least watch the recordings.

If I manage to the above, then the course will be a success for me. The topic is inherently fascinating to me and I would love to be helped with how learning and knowledge analytics could help my professional practice.

Looking forward to meeting other participants and learning together!

It would be great if some of my readers would also be able to join!

Blogging 2010 in Review (Stats Generated by WordPress)

Apologies, the below is automatically created by WordPress. It is mainly interesting for me, myself and I…

The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads Wow.

Crunchy numbers

About 3 million people visit the Taj Mahal every year. This blog was viewed about 25,000 times in 2010. If it were the Taj Mahal, it would take about 3 days for that many people to see it.

In 2010, there were 29 new posts, growing the total archive of this blog to 103 posts. There were 50 pictures uploaded, taking up a total of 3mb. That’s about 4 pictures per month.

The busiest day of the year was December 6th with 282 views. The most popular post that day was So what did I learn at Online Educa 2010?.

Where did they come from?

The top referring sites in 2010 were twitter.com, hansdezwart.info, moodle.org, siloinsiproche.com, and dommel-valley.org.

Some visitors came searching, mostly for imdb api, teaching, moodle 2.0, segway, and elgg.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

So what did I learn at Online Educa 2010? December 2010
10 comments

Did You Know Moodle 2.0 Will….? (Online Educa 2009) December 2009
34 comments

Where is IMDB’s API? May 2009
4 comments

Moodle Books from Packt Publishing January 2009
12 comments

The Future of Moodle and How Not To Stop It (iMoot 2010) February 2010
11 comments