Wednesday, December 26, 2007

HTML, XML, OSIS, XSEM, ThML, USFM...: What a biblical scholar should know

If the title to this post hasn't already caused your eyes to glaze over, this post is intended to provide a brief overview of what is happening with biblical encoding systems. This is a really simplified description with all sorts of caveats associated with such generalizations, but this should at least familiarize a biblical scholar with the field of biblical text encoding and equip him/her to name-drop acronyms with the geekiest of them.

We have all become quite accustomed to reading digital versions of texts related to biblical studies, but there is considerable attention being paid to how that text is presented and how it may be enhanced. For viewing on the web (and in some programs), the system used is HTML=HyperText Markup Language. HTML only describes how the data looks on a page: paragraphs, lists, bold, italic, etc. These codes are consistent, are established by a worldwide consortium, and they are important so that different browsers know how to display the data. To make more sophisticated styles and to provide for global (i.e., across a whole web site) changes to styles, HTML is often enhanced with CSS=Cascading Style Sheets.

If HTML is focused on how data is displayed, XML=eXtensible Markup Language is interested in describing the data and indicating what type of data it is. XML can be used within HTML to both describe how the data is displayed and what kind of data it is. While HTML has broadly accepted standards, XML tags can be defined by the content creator. In general terms, you can then combine HTML and XML to come up XHTML.

Here is where it becomes interesting for biblical texts, because XML can help us make all sorts of distinctions about what is going on with a text. What might standardized XML tagging do for scriptural texts? It could be used to indicate

  • A Scripture citation (We need standard ways to refer to each biblical book, how to designate chapters and verses, what punctuation to use for separating chapter:verse, etc.)
  • Greek or Hebrew lemmas underlying English translations
  • When Scripture is citing other Scripture, e.g., when the NT is citing an OT text.
  • Who is speaking You could, therefore, conduct a search looking only at the words of David or Jesus or Peter...)
  • Which translation you are citing or alternative readings or when there is a summary heading that is not actually part of the text or... The list of possibilities is quite long.
Now, this is all wonderful information that can be embedded within a text that we are able to summon as needed, and there are people doing this work, but there is not as yet an agreed upon standard used by all biblical scholars and publishers. There is a very helpful table and summary by Kahunapule Michael Johnson, but I will summarize the summary to save you some time.
  • SFM=Standard Formatting Markers: Like HTML, this uses backslash codes to define elements. There were no standards with SFM, so it has been superceded by what follows.
  • ThML=Theological Markup Language: I'll include this here as another attempt at providing the kind of encoding we are discussing. This system is used on the CCEL site (another acronymn you probably should know and probably already do: Christian Classics Ethereal Library), and it is a clever implementation for which they have quite a few Bibles and other books available. There is ThML Viewer and updated ThML Reader, but the Reader has not been fixed to work with systems with IE7 installed. (And you probably do have IE7 installed.) So, this is an interesting project, but it is limited and not likely to become a standard.
  • XSEM=XML Scripture Encoding Model: This system was proposed by the important and influential SIL International organization. It does not seem to have gathered much support, and on their own web site, they now appear to be promoting the OSIS standard.
  • OSIS=Open Scriptural Information Standard: Since OSIS is co-sponsored by the American Bible Society and the Society of Biblical Literature, it has significant support for becoming a standard. It is, however, quite complex (which is both a strength and a drawback).
  • USFM=Unified Standard Format Marker: This is a simplified system of marking using backslash codes, but it can easily be converted to USFX which is an XML format. As Johnson notes, it is easier to convert from USFM to OSIS than OSIS to USFM/X, so it is a good choice for now. There is also a free WordSend program that can be used to move between USFM/X and Microsoft Word DOC / RTF / HTML.
How much of this stuff does a biblical scholar need to know? Probably very little if any. It is, however, worth knowing about it, because it does indicate what sort of possibilities do exist for ways we can enhance digital texts related to biblical studies.

Post a Comment