Extremely Annotational RDF Markup is a meta-syntax for non-embedded markup that can be used for stand-off annotations of textual content with fully W3C-compliant technologies. EARMARK is based on an ontologically precise definition of markup that instantiates the markup of a text document as an independent OWL document outside of the text strings it annotates, and through appropriate OWL and SWRL characterizations it can define structures such as trees or graphs and can be used to generate validity constraints (including co-constraints currently unavailable in most validation languages).
The EARMARK API (version 0.4) is downloadable form SourceForce.
- EARMARK Ontology: http://www.essepuntato.it/2008/12/earmark
- Pattern Ontology, i.e., an ontology defining formally patterns for segmenting a document into atomic components, in order to be manipulated independently and re-flowed in different contexts.: http://www.essepuntato.it/2008/12/pattern
- EARMARK Overlapping Ontology, i.e., an ontology for modelling and inferring overlapping scenarios in EARMARK documents.: http://www.essepuntato.it/2011/05/overlapping
All the examples based on the first three verses of the Paradise Lost by John Milton (the EARMARK document is available here) are available at http://www.essepuntato.it/2010/04/ParadiseLost/test.
Other examples of EARMARK documents are available at:
Tools and applications
- XML2EARMARK is a Web application that allows to convert an XML document into an EARMARK one (linearised in RDF/XML). Available at http://www.essepuntato.it/xml2earmark
- Fretta is a Java framework for converting EARMARK documents (even with multiple overlapping hierarchies) into XML ones according to user specifications. Available soon.
Abstract. Overlapping structures in XML are not the symptoms of a misunderstanding of the intrinsic characteristics of a text document, nor the evidence of extreme scholarly requirements far beyond those needed by the most common XML-based applications. On the contrary, overlaps have started to appear in a large number of incredibly popular applications hidden under the guise of syntactical tricks to the basic hierarchy of the XML data format. Unfortunately, syntactical tricks have the drawback that the affected structures require complicated workarounds to support even the simplest query or usage.
In this paper we present EARMARK, an approach to overlapping markup that simplifies and streamlines the management of multiple hierarchies on the same content, and provides an approach to sophisticated queries and usages over such structures without the need of ad-hoc applications, simply by using Semantic Web tools and languages.
We compare how relevant tasks (e.g., the identification of the contribution of an author in a Word Processor document) are of some substantial complexity when using the original data format, and become more or less trivial when using EARMARK. We finally evaluate positively the memory and disk requirements of EARMARK documents in comparison to Open Office and Microsoft Word XML-based formats.
Abstract. An increasing part of research in the Semantic Web has been mainly directed at making data become the main concept of the Web, thereby moving the focus off documents, its previous object of discourse, and off the purely-structural markup declared for them. Plenty of languages and specifications, such as RDFa, support this transition to the Web of Data, and work by inserting additional markup into Web documents and to enhance it by providing a semantic connotation. Yet, while research effort is mainly towards adding semantic annotations around text, little attention is being paid to the possibility of expressing the actual structures of the documents in a form suitable for the Semantic Web. EARMARK is a model for explicitly expressing structural assertions of markup and documents, allowing a straightforward integration of the semantics of the markup and the semantics of the content of a web document. The well-formedness of the hierarchy becomes an explicit assertion, and not a requirement of the markup syntax, and similarly the analysis of the validity of markup structures or the adherence to content model patterns become matter for further semantic analysis. In this paper we present an exhaustive description of EARMARK and we show a framework for using OWL ontologies, that implement particular markup properties (such as markup schemas), to demonstrate the compliance of EARMARK documents with those properties.
Abstract. The correct interpretation of markup semantics is necessary for the semantic interpretation of linguistic expressions that use markup in their structuring and for enabling sophisti- cated operation on markup documents, such as semantic validation, multi-format document conversion and searching on heterogeneous digital libraries. The semantics of XML-based markup languages is usually provided informally, for example through textual descriptions in the specification of the language. While the syntax of XML-based languages is entirely machine-readable, its semantics is obscure for machines. Semantic Web technologies can be useful for filling the gap between the well-defined syntax of a language and the informal specification of its semantics. In this paper we show how to integrate LMM, an OWL vocabulary that represents some core semiotic notions, with EARMARK, a model for the specification of semantic and structural characteristics of markup languages, in order to provide a better understanding of the semantics of markup.
Abstract. In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble a document-oriented markup language and would be free of the classical constraints of an embedded markup language, having no limitations given by sequentiality, containment, or contiguity of text fragments. In the past years we developed EARMARK, our OWL proposal for expressing arbitrary semantic annotations about the structure and the text content of a document. In this paper we describe FRETTA, our mechanism for rendering arbitrary EARMARK annotations (including non-sequential, non-hierarchical and non-contiguous ones) in XML, bringing into a unifying framework a half dozen of syntactic tricks used in literature to handle overlapping structures in a strictly hierarchical language.
Abstract. In this paper we propose a novel approach to markup, called Extreme Annotational RDF Markup (EARMARK), using RDF and OWL to annotate features in text content that cannot be mapped with usual markup languages. EARMARK provides a unifying framework to handle tree-based XML features as well as more complex markup for non-XML scenarios such as overlapping elements, repeated and non-contiguous ranges and structured attributes. EARMARK includes and expands the principles of XML markup, RDFa inline annotations and existing approaches to overlapping markup such as LMNL and TexMecs. EARMARK documents can also be linearized into plain XML by choosing any of a number of strategies to express a tree-based subset of the annotations as an XML structure and fitting in the remaining annotations through a number of “tricks”, markup expedients for hierarchical linearization of non-hierarchical features. EARMARK provides a solid platform for providing vocabulary-independent declarative support to advanced document features such as transclusion, overlapping and out-of-order annotations within a conceptually insensitive environment such as XML, and does so by exploiting recent semantic web concepts and languages.
Abstract. One of the most evident tenets of the literature on overlapping markup is that the philosophy of documents as trees (as dictated by meta-markup languages such as SGML and XML) is a simplification that sometimes fails and requires corrections. These corrections have been proposed at the markup level (e.g., milestones, segmentation), at the meta-markup level (e.g., LMNL, TexMecs, XCONCUR, etc.) or at level of the abstract model (e.g., GODDAG). Unfortunately full GODDAGs do not allow linearizations in general, and as such a restricted version of GODDAG, r-GODDAG, has been proposed that is guaranteed to be linearizable (in TexMecs) and still allows many nice features beyond trees.
In this paper we discuss that the problem of linearizing more-than-hierarchical structures lies basically in the embedding of markup within content and that no such problem arises with an appropriate standoff approach, that is able to represent full GODDAGs without restrictions. This gives ample opportunities to deal with interesting markup features that are describable with GODDAGs but not with r-GODDAGs, such as non-contiguous elements and virtual elements.
Besides, we discuss whether a specific constraint of full GODDAGs is really necessary once all residual hopes of embeddability are given up, and we further propose a minimal extension to GODDAG, genially called “extended GODDAG” (e-GODDAG) that, by removing the requirement for names in non-terminal nodes, adds support for additional interesting markup features such as content repetitions. In truth, e-GODDAGs are even less embeddable than full GODDAGs, but they are just as easily dealt with by using stand-off markup.
We further propose a meta-syntax for non-embedded markup, called EARMARK, that can be used for stand-off annotations of textual content, and that naturally represents e-GODDAGs with fully W3C-compliant technologies. EARMARK is based on an ontologically precise definition of markup that instantiates the markup of a text document as an OWL document, and through appropriate OWL and SWRL characterizations it can define structures such as trees, r-GODDAGs, full GODDAGs and e-GODDAGs, and can be used to generate validity constraints (including co-constraints), and to verify adherence to content model patterns.
As mentioned, in general the embedding of a full EARMARK document is not possible, but approaches can be taken in that direction: just like segmentation and fragmentation are strategies to embed in a strictly-hierarchical language a r-GODDAG-specific feature such as overlapping elements, similarly a number of strategies exist to provide embedding of GODDAG and e-GODDAG features in less expressive syntaxes. In the final part of the paper we discuss our wish to provide at the metalanguage level a series of embedding strategies of the non-hierarchical features of EARMARK, i.e. a number of language-independent mechanisms to express e-GODDAGs structures into XML (as well as in TexMecs and in LMNL) and that can be recognized as such (i.e., as strategies, as tricks) by tools and readers alike, especially for further uses of such documents.
Abstract. A lot of applications handle XML documents where multiple overlapping hierarchies are necessary and make use of a number of workarounds to force overlaps into the single hierarchy of an XML format. Although these workarounds are transparent to the users, they are very difficult to handle by applications reading into these formats. This paper proposes an approach to document markup based on Semantic Web technologies. Our model allows the same expressiveness as XML and any other hierarchical meta-markup language, and, rather than requiring complex workarounds, allows the explicit expression of overlapping structures in such a way that search and manipulation of these structures does not require any specific tool or language. By simply using mainstream technologies such as OWL and SPARQL, our model – called EARMARK (Extremely Annotational RDF Markup) – can perform rather sophisticated tasks with no special tricks.