19 March 2009

Digital archiving - about metadata

The more I entange myself in thoughts about keywording, the more I realise it's hard to keep all this metadata stuff apart. Different standards, different types of metadata, duplication of data between standards... - And that's just for starters.
So, in an attempt to regain a sort of bird's perspective on what this is all about, I wanted to write this down. If it's useful to you as a reader too, I'm glad.

First, what's metadata anyway.
In my mind, it's as simple as "information about an image". With this definition, everything in a image data file that is not the image itself is metadata. I have included three graphics to illustrate how I imagine metadata is stored in various file formats at the bottom of this post. You may want to have a peek at those while you read, but I thought it best to put them all at the bottom.

Most image file formats have placeholders for metadata. Little compartments where all kinds of information can be stored. There is actually a scary part about that; it is perfectly possible to embed malicious program code inside these compartments. Sony's PSP game console, for example, has been attacked by such code embedded inside TIFF files. But I digress. It just goes to show that practically anything can be embedded inside image files. The upside is that there is room for plenty of metadata. If only everyone could agree on how and what to put in there.

That's where the standards comes in, of course. As far as I know there are three different standards to pay heed to; EXIF, IPTC and XMP. So I'll try to make heads and tails of these.

Every bit of info contained in these standards, by the way, can be referred to as a "tag".

EXIF specifies all kinds of technical info, like shutter speed and aperture for example. EXIF can be supplied by the camera automatically at the time of shooting. The tag set is the standard.

IPTC is a way of supplying various sorts of additional info that the photographer can provide. Such as location, subject, copyright, title of image, keywords, free-text description etc., etc. As for EXIF, the tag set is the standard. The list of possible tags is longer than any photographer cares to know about, just have a look at the below screendump from ThumbsPlus v7.

Figure 1. The IPTC registration panel in ThumbsPlus
Click image to enlarge.

I find IPTC quite confusing, because there is no intuitive way of knowing exactly how each of the fields are supposed to be used. Take the tags "object name", "title", "headline" and "caption", for example. For all practical purposes, one or two of these will do. But which ones? The same kind of overlap exists between "caption", "keywords" and "supplemental categories". When I'm through this study of archiving software I might have a clue, but how much time should a lazy photographer be required to spend, just to embark on the tedious job of keywording images? Not this much, surely?
It makes me think that I will be partial to any software that takes the thinking out of this process as much as possible.

XMP has a great advantage over the other two, in that it doesn't have to be embedded into the image file itself. XMP can also be stored in a sidecar-file. I have slapped my forehead several times already (Homer Simpson style) for not realising until now the huge bonus this brings: It opens for keywording raw files. For file formats that support embedding, however, XMP will just occupy another one of those compartments I mentioned earlier.
XMP will gladly duplicate anything you can find in EXIF and IPTC, but it can also add a lot more. The X in XMP stands for eXtensible, and in principle it's so open that any software developer can add their own (and proprietary) tags. So it's not really the tag set that constitutes the standard, but the method used to add tags and information into the compartment. That's both its beauty and its curse, it seems. Most of the tags used in XMP are described by Adobe and adhered to by all others. What's funny, though, is that Adobe doesn't quite adhere to it themselves. That is, LightRoom has expanded the tag set somewhat for its own purposes, and the extra tags are not correctly implemented by Adobe Bridge (the file browser in Photoshop). :-)
In general, it seems that the software industry has been quite reluctant to adopt XMP, trying instead to keep IPTC as the main way of including metadata. There are some very interesting differences between my shortlisted programs in how they deal with IPTC and XMP which I will address when considering the individual products.

My next post, however, will elaborate on the virtues of hierarchical keywords, which was part of my wishlist in the second post in this series. I will explain in a bit more detail what I am thinking about, and how it can be fitted into XMP and IPTC.

Figure 2. A raw file will contain EXIF-data supplied by the camera, plus whatever proprietary information the camera vendor puts in. Nikon, for example, is infamous for encrypting the vendor specific info in this part of their raw (*.NEF) files. IMO, it can be argued that the embedded thumbnail is actually a sort of metadata, since it is just a representation of the real thing. I've never seen it referred to that way, though, just my personal opinion.
XMP-data may be added to the file as a sidecar, while IPTC can not be added.

Figure 3. TIFF and DNG files will embed just about anything. There may be proprietary camera information in here too, I think, but I believe that is supposed to go into the XMP compartment. If I'm wrong, it just means that there could be another blue box on top. :-)
The DNG is based on the TIFF, by the way, and both formats are controlled by Adobe...

Figure 4. A JPEG file will embed just about anything too, except for a thumbnail as far as I know.

No comments: