This afternoon I attended a session at info.nl in Amsterdam with Brewster Kahle who wants to create “Universal Access to All Knowledge”. He has founded The Internet Archive, a non-profit library with about 150 people. It is best known for its Wayback Machine (collecting about 5 billion web pages a month, amazingly still fitting in a container).
They are convinced that it is feasible to store all the world’s knowledge. Texts are being digitized (i.e. scanned) for representation on the screen (see Open Library for examples) and are openly available. The Internet Archive have made their own scanners pushing the costs per scanned page (mostly labour) down to about 10 cents per page. Their scanning centers now have 3,000,000 free ebooks available online (incl. 500,000 for the blind/dyslexic and 250,000 modern books available for lending) and they have about 8 million more to go. They have made a book mobile that can download and print a book for about one dollar.
They are also focusing on archiving all audio, offering unlimited storage and unlimited bandwidth for free and for ever to bands who want to store their tapes online. They have over 1 million audio items in over 100 collections. They are doing similar things to moving images, making permanent archives of video sites that have gone out of business, home movies and even television (do check that one out, it makes TV news quotable and even includes a lending model for physical DVDs of TV news).
They store their 10 Petabytes of data in a redundant fashion and also store 600,000 books in a physical archive (growing fast of course).
Brewster also talked a little bit about his case against the US government when he received a national security letter from the FBI which was deemed unconstitutional with a bit of help from the EFF and from the fact that he is a library.
Daniel Erasmus from Digital Thinking Network (DTN) did a short presentation on NewsConsole which uses a big data approach and aims to collect all the world’s news and put it in an interface that allows for easy interacting with it. I’ve been using it for a while to find news in the field of learning technology. I particularly liked his key lessons from working with big data, like:
- SQL won’t cut it
- Big data is messy, a lot of effort goes into cleaning it up
- Moving a petabyte of data is very expensive and difficult, store it correctly the first time
- Testing on small subsets doesn’t work, because you get unexpected bottlenecks when you scale
- It is a humbling problem