Thursday, December 25, 2008

A Tag-based Filesystem

Filenames are nice and all, but they're a terrible way to store metadata. Imagine having hundreds of files (which technically you already do, regardless of OS, just after finishing a virgin install, but these are personal files we're talking about), all of which might have some things in common and some things unique. If you use the filename to store the metadata on each, you can potentially reach a point where you need to give two files the same name, which doesn't work unless they're in separate directories, but what if they're both photos of Ted drunk on the same vacation on the same night? The best you can really get is "Ted drunk 1.jpg" and "Ted drunk 2.jpg", and while it works, the numbering is essentially arbitrary and meaningless.

On Unix-based OS filesystems, which are case-sensitive, you could mess with the case of each letter, but that's just stupid and makes the result a pain to deal with. On Windows, it's case insensitive but case preserving. This means you can't have a readme.txt and a Readme.txt in the same directory on Windows. (try it and see)

The solution would be a special filesystem. It would allow the user to tag each file with relevant metadata, and define a tag heirarchy that could be used to build a virtual directory structure. We could drop the filename entirely. We'd still have to store it, for compatibility with other systems (think uploading an image to your website or something), but that's the only case where the filename would need to be accessed. The filesystem could pretty much just be a relational database with tags and MD5/SHA-1 hashes to uniquely identify files, and then the file stored as a blob (that's "Binary Large Object", for the uninformed).

Having the hashes gives our filesystem built-in file integrity checking. By comparing the current hash of the file with what it's expected to be, you can easily see if the file has changed. Since some websites are nice enough to give you the MD5 hash of the file you're downloading, we could have something on the order of a Firefox extension that could check the file after it downloads. Or I guess an Opera widget for the fags.

Programs could still say "hey, make a directory in Program Files with this name and put all this stuff in it." What this would translate to in the filesystem is the aforementioned tag heirarchy, which is used to build the virtual directory structure. On Windows, Firefox installs to C:\Program Files\Mozilla Firefox. Our filesystem would translate this into the tags C, Program Files, and Mozilla Firefox. In the heirarchy, C would be the parent tag to Program Files, which would be the parent tag to Mozilla Firefox. You could even have another tag for that hard drive (other than C, which is rather meaningless and thus defeats the purpose of having tags, but on Windows systems would have to be there for compatibility), so if you're like me and you have one drive for your OS and applications, another drive for games, and a third for downloads, you could label them as such, kinda like you already can.

You could also have other tags that don't fit into the heirarchy, that can show up anywhere. For instance, think of your directory full of vacation photos. You probably have multiple vacations stored in there. You could tag the photos from each with the place you went, the people in the photo, the specific place the photo was taken, the date the photo was taken, and so forth. Then if you say to yourself one day "Let's look at that funny vacation photo of Ted dancing drunk on the kitchen counter", then you could just look at all the photos with the tags Ted and drunk, and then from there use an image viewer to set up a slideshow. See how easily you could locate one specific file with tags?

Files could be moved easily by simply changing a tag. You could have one file show up in multiple places (wherever it's relevant) simply by adding an extra tag to that file. Some tags could be generated automatically, such as the file type, filename, and the virtual directory structure. A program could find its files easily regardless of their name or any custom tags that might be on the file, and could even check the integrity of its libraries and such before loading them.

There would, of course, be a browser and a search utility, but they'd basically be the same thing. It would be a bit weird to use with existing OSes. That's probably why the existing tag-based filesystem projects are simply virtual filesystem overlays done in FUSE.

This could make file type designations really easy, as there would just be an entry in the filesystem for the type of the file, probably the MIME type just to make other things easier.

This would also pose some problems. Viruses, Spyware, and other malicious files wouldn't have the common courtesy to tag themselves as such, and could probably hide from the user very easily by setting a zero-length tag. In addition, asshole software vendors could use the aforementioned zero-length tag trick to hide whatever they want from the user. There would have to be a check in the filesystem for zero-length tags to alert the user to them, and possibly even automatically generate tags based on the given filename when it's written to disk to eliminate this entirely. Also, there would have to be a way to prevent malicious programs from messing with the metadata and potentially corrupting the entire filesystem.

A usability problem is that it would be difficult to transfer files over the internet between two computers both using this filesystem and keep the metadata intact. We shouldn't require the user to export the metadata to a separate file and distribute it along with the file for other users to import (though perhaps this functionality would be good to have for other purposes), and embedding it in the file itself would require extending the file format and would thus break compatibility with other systems which wouldn't know what to do with the added data. Perhaps a solution could be something similar to an archive format, like tar. It would have its own extension, and simple utilities could be written for other systems to extract the file out and show the metadata.

Quite the Christmas day post, huh?

No comments:

Post a Comment

I moderate comments because when Blogger originally implemented a spam filter it wouldn't work without comment moderation enabled. So if your comment doesn't show up right away, that would be why.