I've had a couple of conversations recently with Bruce D'Arcus, in particular about the tendency for people working on bibliographic software of one sort or another to get stuck on the BibTeX data model. We are agreed that that data model is basically broken; it was never perfect and it has become less so as time has passed.
As an aside before I continue, I once had cause to try to modify a BibTeX style (chicago, which seemed at the time to have a rather nasty formatting bug). It turns out that BibTeX styles are written in a postfix stack language. If I hadn't learned to use an HP reverse Polish calculator as a kid I would have been completely bamboozled by it. I got somewhere with my modifications, but then I came across a script which would ask a series of questions and build a BibTeX style accordingly, so I was released from having to learn a truly obscure language properly.
Bruce's weblog is probably a good place to watch for news on ongoing efforts to develop new ways of encoding bibliographic data. Most of these are developing XML schemas for the purpose, while there is at least one RDF based effort. Unfortunately many of these are doing no more than expressing the old BibTeX data model in XML or RDF.
Bruce himself is a MODS evangelist. I haven't had time to get my head round MODS, and the fact that I feel I need to spend time doing so worries me slightly. From what I can gather, MODS is largely a simplification of the MARC 21 (MAchine-Readable Cataloging) bibliographic format, and is developed by the US Libraries of Congress. I say largely a simplification because, for example,
Most MODS elements have equivalent ones in MARC 21, although there are a few that do not exist in MARC 21 because they were felt to be particularly important for digital resources, a particular target for MODS. [MODS User Guidelines]
This close relationship with MARC appears to carry over a lot of cruft, and I fear that MODS as a result is just too confusing to see widespread adoption in personal tools. This impression might in part be because MODS is explained by reference to MARC, but that doesn't help much, since it suggests that a full understanding of MODS would require considerable delving into MARC.
So what's wrong with BibTeX, anyway? What do we need as a modern replacement? Much current interest in bibliography management is around formatting bibliographies in XML and HTML documents, so an XML based format would seem to be a starting point. This requirement would be served well enough by a syntax translation to XML. Unfortunately BibTeX has a hard coded set of publication types which only serve a fairly narrow set of users (physcial scientists) and that set has not, for example, been expanded to allow easy reference to online resources.
Things get even more interesting if you look beyond simple reference management and bibliography formatting. I'm sure I've whinged here before about the absence of software which really supports the research process. Pybliographer, like EndNote, is a simple card file at heart. Bibliographic data needs in the longer run to be part of a rich information management environment, not a stand alone application designed to ease the pain of formatting references to a Journal's specification. But it seems that the requirements for these tools might conflict.
Since I've been thinking about RDF recently, and since I'm pretty sure that graph semantics such as those of RDF are essential in an information managment tool, I started thinking about using RDF in this context this afternoon. Here's a first take, using FOAF for names (a combination of using foaf:Person nodes and VCard name details might be better, for that matter I wonder whether FOAF shouldn't be using VCard elements anyway). The schema is left implicit.
Ideally an RDF model would model publications and the people and organisations involved in making them sensibly. A publication node, instead of having the name of an author listed as a string, should be linked directly to a node representing the person in question. This way my bibliographic database can mesh with, for example, my address book, and I should be able to click on an author in a view of a bibliographic record and be taken to that author's contact details if I have them.
Just here is an example of the above mentioned conflict. Instead of author names as strings, I record a reference to a node representing that author, attached to which are details of the authors name. When I need to generate a reference list, how can I ensure that the author's name is rendered correctly in each instance? Somehow with the link from publication to author needs to be recorded the way the name was listed on the publication itself (Hamish Harvey, or H. Harvey?).
Problems like this must be surmountable, surely. More fundamental from a data modelling point of view is the use of specific publication types (even if extended from BibTeX's limited set). Bruce argues that this focus on publication types is fundamentally flawed, and I think he might have a point.
One tricky aspect of using RDF is that the same RDF can be serialised in XML in any number of different ways. For XSL trickery to work in rendering reference lists I imagine that further constraint would be necessary. One might perhaps write an XML schema, a valid instance of which is also valid RDF. If an RDF store was used for holding reference information, then the serialisation from this store would have to be programmed to produce RDF/XML which conforms to this schema, otherwise later XML only processing tools would choke. The example was not written with this in mind, but it shows the RDF graph reasonably clearly I think.
Enough rambling. Suggestions of improvements to the RDF gratefully received, though I don't anticipate finding time to think about this again for a week or three.
Hi Hamish,
I found this entry from Bruce's blog. I'm the one who marked up bibTeX in the OWL ontology language.
As I work in a neuroscience lab, bibTeX seemed like a natural starting point. But after reading more about the state-of-the-art in bibliography research, I do agree that the bibTeX model is rather limiting. In fact, one of the concerns you brought up (inability to easily mark-up online resources, problems with hard-coding author names) led me to think down the lines you went, i.e., including the FOAF vocabulary.
If you want more constraints it's probably best to describe the general form of the bibliography in OWL (http://www.w3.org/2001/sw/WebOnt/) which allows you to define what is a subclass of what, what domains and ranges properties are allowed to have, and so on. You can see an example with my bibTeX ontology.
One other issue I've been dealing with is how to deal with the order of authors. RDF parsers (and the RDF graph model in general, I believe) does not retain the ordering of nodes in the source file. There might be ways to get around with with some other RDF terms (like rdf:Seq), but I haven't had the time to work out how exactly.
Feel free to drop me a line if you're interested in talking more.
Nick
Posted by: Nick Knouf | January 12, 2004 at 02:19 PM