Where data goes when it dies and other musings

I’ve been wanting to write about Ma.gnolia’s catastrophic data loss last week ever since it happened, but wasn’t quite sure how I wanted to approach it. Larry (Ma.gnolia founder and the sole person who maintained the site) is a good friend of mine, and Ma.gnolia was one of Citizen Agency’s first clients. It’s been painful to see him struggle through this, both personally and professionally, and it’s about the worst possible [preventable] thing that can happen to a Web 2.0 service.

Still, kept in context, it’s made me reconsider some things about the nature and value of open, networked data.

I. How I Learned to Stop Worrying and Love the Bomb

According to Google’s cache of my profile on Ma.gnolia, I had accrued 5758 bookmarks and 6162 tags since I first started using the service August 08, 2004. That’s a lot of data capital to have instantly wiped out. You might think that I’d be angry, or disappointed. But I’m surprising zen about the whole thing. Even if I never got any of my bookmarks back, I don’t think I’d be that upset, and I’m not sure why.

If Flickr went down, I’d be pretty pissed. But Ma.gnolia for me was primarily a tool for publishing — something that I used to broadcast pointers to things that I took a momentary fancy in. There’s a lot of history in my bookmarks, no doubt. In some ways, it’s a record of all the things that I’ve read that I thought might be worth someone else reading (hence why my bookmarks are public), and clearly is a list of things that have affected and informed my thinking on a broad array of topics.

But, the beauty of bookmarks is that they’re secondary references to other things. The payload is elsewhere and distributed. So in some ways, yeah, I mean, there’s a lot of good data there that’s been lost (at least for the moment). But, the reality is that the legacy of my bookmarks are forever imbued in my brain as changes in how my synapses fire. The things that I can’t remember, well, perhaps they weren’t that important to begin with.

II. Start over; the blank slate.

Leopard Blank Slate

With the money I won from the Google/O’Reilly Open Source award last summer, I decided I’d break down and by myself a new MacBook Pro. As I was initially setting it up, I figured I’d transfer my previous system setup over from my Time Machine backup and just pick up from where I left off.

I did this, but once I logged in, the new MacBook lost it’s feeling of newness, and I felt encumbered. What amounted to bit-for-bit data portability left me feeling claustrophobic and restricted. I wanted the freedom of a clean system back; somehow buying a new machine wasn’t just about better performance, but about giving myself license to forget and to start over and to make new mistakes.

I wiped the hard drive and reinstalled OS X with the minimum options. I’ve installed about ten apps so far, and I intend to hold off on anything that I don’t feel an absolute need to install, taking a hint from Ethan Kaplan:

Twitter / Ethan Kaplan: @factoryjoe only install a ...

III. And the band played on

While I love the form-factor of my MacBook Air (now my previous system), the first generation just isn’t fast enough or beefy enough for the way that I use a Mac. It’s great for email and traveling and it really is the machine that I want to be using — just with better performance (though I hear the new models are much better).

Because the hard drive on the thing is pretty miniscule by today’s standards (80GB), I quickly maxed it out with music, videos, photos and screenshots. I was down to about 6GB of space, and OS X crawls when it can’t cache the shit out of everything so I decided to take aggressive action and deleted my entire 30GB iTunes library.

Command-A. Command-Delete. Empty Trash.

And then it was done.

Now, I still need iTunes for iPhone syncing, but now I have no local music store. With the combination of Spotify, SimplifyMedia and Pandora (using PandoraJam or PandoraBoy), I’ve got a good selection of music wherever I’ve got wifi.

The act of deleting my entire music library (okay fine, I do have a complete backup on my Mac Mini media center) was cathartic. All that data… in an instant, gone. All those ratings, all that metadata, all those play counts revealing my accumulated listening habits. Gone (well, except for my Last.fm’s profile).

Of course, it’s not like I had original, irreplaceable copies of these tracks. There are copies upon copies out there. And knowing this, I intentionally destroyed all this data without really worrying about whether I’d ever be able to re-experience or relive my music again. In fact, I didn’t even give it a thought.

But my system sure seems a bit faster now.

IV. Microformats are the vinyl of the web

Vinyl is 4 Ever by Bruce Berrien

The first thing that I thought about when I heard that Ma.gnolia had had “catastrophic data loss” was that Google and Yahoo probably had pretty good caches of the site, especially given its historically high PageRank. The second thing that I thought about was that, since the site was microformatted with XFN, xFolk and other formats, recovering structured data from these caches would likely be most reliable way of externally reconstituting Ma.gnolia, in lieu of other, more conventional data retrieval methods.

Though Larry is still engaged in a full out recovery process, it gave me some sense of pride and optimism that we had had the forethought to mark up Ma.gnolia with microformats. Indeed, this kind of archival purpose was something that Tantek had presaged in 2006:

Microformats from the beginning in my mind are serving two very important purposes.

  1. Microformats provide simple ways of identifying larger chunks of information on the Web for easily and immediately publishing, sharing, moving, aggregating, and republishing.
  2. Microformats are perhaps a step forward in providing building blocks for the longevity of higher fidelity information as well.

In talking with Tantek about this, he pointed out some interesting things about many modern web services, lamenting their apparent lack of concern over longevity. For example, clearly there is a great deal of movement afoot to advance the state of distributed social networking, as evidenced by XML and JSON-based protocols like Portable Contacts and Activity Streams. But these are primarily transaction-based protocols, and archive poorly (another argument for RESTful architectural, certainly).

I would therefore agree with Tantek’s oft-repeated admonishment that services that are serious about their data should always start by marking up their sites with microformats and then add additional APIs to provide functionality (as TripIt did). It’s simply good data hygiene. It’s also about the separation between form and function (or data and interactivity). And with emerging technologies like , people can now build arbitrary mashups from the HTML on your homepage, without even having to know about your custom API.

It also means that, in the event of catastrophe (Ma.gnolia’s case) or dissolution of a service (as in the cases of Pownce, Journalspace or Consumating), there is some hope for data refugees left out in the cold.

When APIs go dark, how do you do a data backup? (Answer: you often can’t.) With public, microformatted content, there will likely be a public archive that can be used to reconstitute at least portions of the service. With dynamic APIs and proprietary data formats, all bets are off.

V. Death and data reincarnation

With both the intentional and unintentional destruction of data recently, it’s given me lots to ponder about in terms of the value, relevance, importance and longevity of data.

I talk about “data capital” like it matters, because I suppose I want it to, and hope that someday it does make a difference just how much of yourself you share with the world, simply because it’s better to share than not to.

And now I’m in this funny situation where, because I did share, and shared openly (specifically on Ma.gnolia), there is the very real possibility of reincarnating my data from the ether of the web. It could just be that all the private data, including messages, private bookmarks and thanks are forever gone, because they were kept private. But those things which were made available to anyone and everyone, through that simple aspect, can be reconstituted by extracting their essence from the caches of the internet’s memory banks.

You think about photographs of people who have died, and of videos and other media. In the past several years we’ve had to start thinking about what happens to social networking profiles on Facebook, MySpace and Twitter of people who are no longer with us. Over time, societies have invented symbols and rituals to commemorate the dead, and often use items imbued with the deceased’s social residue to help them remember and recall and relive.

How do that work when those items are locked away in incompatible and proprietary data stores? How do we cope when technology gets between humans and their humanity?

The web is a fragile place it turns out, in spite of its redundancy and distributed design.

Efforts that threaten to close it up, lock it down or wall it into proprietary gardens are turning the web against us, against history and against civilization and the collective memory. This is perhaps one reason of the primary reasons why the open web is so important to me, and factors in so centrally to my work. As I grow older, perhaps I won’t always have perspective on which things will be the most important to me, but it’s critical that in the future, I don’t inhibit my and my progeny’s ability to access my digital legacy.

Ma.gnolia logoI find it fitting that Ma.gnolia uses an organic symbol as its logo. It has, for all intents and purposes, died.

But there is a silver lining here, and I think Larry intuitively understands: in the Ma.gnolia Open Source (M2) project, he had already sowed the seeds for Ma.gnolia’s rebirth. Though it is lamentable that a such disaster would occur, I believe that creative destruction is absolutely necessary to natural systems, as forest fires are critical to the lifecycle of forests.

I also believe that things happen for a reason and that the soil of this tragedy will lead to a new start and new growth. It’s not accidental that the design of M2 called for a distributed, redundant mesh of independent bookmarking service endpoints. If anything, this situation provides Larry license to start anew, proving the necessity of death, and the wisdom of genetic inheritance and variation.

Advertisements