An Implementation of BITS: The Cambridge University Press Experience

Mike Eden; Tom Cleghorn

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2016.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet].

Show details

Contents

An Implementation of BITS: The Cambridge University Press Experience

Mike Eden and Tom Cleghorn.

Cambridge University Press's history of using mark-up for academic book content resulted in a proprietary model, loosely rooted in NLM DTDs. With new requirements and new types of content, the DTD and related business rules were added to. These changes became frequent and unpredictable, creating pain-points in production workflows and resulting in a heavy burden, both internally, in keeping automated processes and human stakeholders up to date, and externally, in maintaining suppliers' knowledge. A decision was taken to review the situation. Outside consultants were engaged to perform the review. It became clear that the choice of model and its manner of use constituted only one part of the picture, and that a review of the entire process would be beneficial. BITS was chosen as the DTD to use, rather than redefining proprietary DTDs. As an emerging industry standard, it is closely aligned with the Journals NISO standard (NISO Z39.96-2012), already in use at Cambridge. It was also felt that as the standard grew, benefits would arise from the input and requirements of other publishers. During implementation, consideration was given to, among other aspects:

control of use (sub-setting or Schematron)
use of MathML and other similar standards
schemes for persistent element identification
approaches to metadata encoding
output intentions and specifications

Workflows were amended from copy editorial through to final delivery, with an emphasis on control of mark-up throughout. This resulted in a considerably better-defined and predictable process for external providers and internal stakeholders, and has resulted in the creation of a robust foundation for book production into the future.

Background

Cambridge has had 20+ years of thinking about the challenges of data capture using our own mark up. From proprietary typesetting software with embedded syntax tagging through SGML capture of article metadata for publication online in the late 90s, and on to full-text capture for Books and Journals from supplier typesetting systems. Around the turn of the millennium it was decided to produce a full-text XML DTD that could accommodate both journal and book content. The result was the CamML (Cambridge Mark-up Language) DTD. A bespoke rule-based system, CAVEAT (Cambridge Validation Error Assessment Tool) , was used to provide automated QA validation.

This DTD proved to be extremely complex, and particularly so for the typesetters, who were encouraged to employ XML-first workflows. After a few years of development with chosen vendors, it was felt that CamML was too unwieldy.

As an example of the problems faced, consider the following, expressing the start of a simple paragraph containing styled text:

Fig. 1An example of CamML, illustrating its verbosity.

<paragraph id="ttq-t0r-av1-xy4">
    <lineatedText role="break">
        <line/>
    </lineatedText>With contributions from <render fo:font-weight="bold">Peter 
    Rosenbaum</render> <render fo:text-transform="uppercase">MD FRCP</render>,
    [...]

Subsequently two distinct DTDs were created, one for journals (CJML) and one for books (CBML). These DTDs were loosely based on the NLM models, and were intended to be easier for typesetters to implement within their systems.

QA control was achieved through a combination of DTD validation and rule-based grammars, with the bulk of model constraints expressed in the DTD and the rules applying more granular constraints. The rule-based grammars were initially implemented in Caveat, but were later replaced with Schematron.

Although the primary gain in CBML was its nature as a model specific to books, many other improvements were made. For instance, where CamML represented books as a collection of separate documents, each corresponding to a chapter (or equivalent), CBML instead contained the entire text in a single document; also, where CamML deliberately omitted static item numbering, using instead markup similar to "<fig><label>Figure <?number-here?>.</label>[...]", CBML encoded literal text for all numbering. Both these changes, and others, allowed much simpler transformations without compromising the utility of the markup.

Fig. 2Two examples of CBML markup.

Note the use of processing instructions to represent page breaks and anchors for floating items.

<?new-page 3?>
<chapter id="c02808-1-1">
    <label>1</label>
    <title>Transatlantic Perspectives</title>
    <subtitle>Fundamental Themes and Debates</subtitle>
    <alt-title type="left-running">Larry A. DiMatteo, Qi Zhou, and Séverine Saintier</alt-title>
    <alt-title type="right-running">Transatlantic Perspectives</alt-title>
    <author-group>
        <author>
            <name><given-names>Larry A.</given-names> <surname>DiMatteo</surname></name>
        </author>
        <author>
            <name><given-names>Qi</given-names><surname>Zhou</surname></name>
        </author>
        <author>
            <name><given-names>Séverine</given-names><surname>Saintier</surname></name>
        </author>
    </author-group>
    <body>
        <sec id="c02808-1-3">
            <p>This chapter provides the reasons and purposes for the writing of this book
            [...]

<sec id="c02808-2-12">
    <label>III.</label>
    <title>Connections between Competing Theories of Contract</title>
    <p>In reading academic treatments of discrete
      contract theories, connections with other theories, as well as the historical merging of
      certain theories into others, are often overlooked. Looking for such connections and
      historical developments may, however, assist in finding a way through the theoretical
      maze.</p>
    <?new-page 31?>
    <?anchor c02808-2-15?>
    <p><target id="c02808-0031"/>What
      connections between the various theories of contract are discernible? One area of inter-theory
      connectivity is discernible when one examines whether specific contract theories emphasize (on
      the one hand) party autonomy and the will, or (on the other) the consequences and effects upon
      parties of contracting. Using these two criteria, it is possible to classify contract theories
      as indicated in <xref href="c02808-2-15">Table 2.1</xref>.
        <table-wrap id="c02808-2-15" position="float">
            <label>Table 2.1.</label>
            <title>A bipolar taxonomy of contract theories</title>
            <table alt-graphic="02808tbl2_1" frame="topbot" orient="port" pgwide="1">
                <tgroup cols="2">
                <colspec colname="1"/>
                <colspec colname="2"/>
                <thead>
                    <tgroup cols="2">
                    [...]

With new requirements and new types of content for books where styles between books could vary so much, the DTD was regularly amended and QA rules added, removed or adjusted. These changes to the DTD and validation scripts became too frequent, creating a pain-point in our production workflows. It resulted in a large amount of work to keep all automated processes (such as EPUB and HTML transformations) up-to-date, also making it difficult for our suppliers to keep up while books were in production. Meanwhile in 2012 journals moved from CJML to JATS, the beta version of the new NISO standard (ANSI/NISO Z39.96-2012)

Having a proprietary DTD allowed us to create new models and structures where needed, enabling us to define our own vocabulary to fit our very specific requirements. Most of the development had been influenced by inward facing workflow requirements. However, for Cambridge, this eventually became a burden to maintain as each change created backward compatibility concerns and imposed maintenance requirements on transformations and other processing scripts.

The need for new structure or elements would sometimes come directly from the typesetter when querying how to capture content not specified in the current documentation, or where editorial codes may not have had relevant context, but often during QA of EPUB. Being this far downstream in the workflow created an urgency that was unavoidable. This led to difficulties in maintaining the model — changes made in haste to accommodate one title would often have unintended consequences for another's validity, for example.

This difficulty was further compounded by the absence of control over content entering the CBML ecosystem; although a Microsoft Word template file existed to provide a framework for content, it was not actively maintained, and as a result became little more than a loose guide, frequently subject to ad-hoc amendments for specific projects and products by suppliers. Eventually this led to various diverging lines of descent, and the utility of the template was lost.

As the burden of maintaining the existing DTD and transformation script started impacting production schedules the pressure to have some sort of change became greater. At this point it was suggested that Books perhaps consider moving to the emerging BITS DTD (NCBI/NLM Books Interchange Tag Suite). It was then that the decision to have a review using consultants took place.

The Review

In 2013, it was decided that an external consultant would be able to give an industry-aware, unbiased recommendation; after a procurement process, consultants Apex CoVantage were chosen to help review our XML model and recommend the best way forward. It became clear during the review that it was necessary to look beyond the XML model into workflows both in production and activities upstream of EPUB and online product creation.

The review by Apex of XML use in Academic production and the recommendations that came out of it raised a number of discussion points about our current model and workflows. To enable the business to best tackle these points several groups were set up to focus on the particular areas. These covered editorial workflows, production workflows, typesetter workflows, internal XML QA workflows, DTD development and changes and platform impacts.

Out of this discussion came a list of issues to be dealt with and questions to be answered; these are summarised below:

CBML to BITS
- Who creates the BITS model? Do we outsource or keep in-house?
- Validation process of the new model – who and how long would this take?
- How long would it take to create, QA and test?
- Do we go for a more modular approach so that have multiple scripts to cope with variety of content?
- Implications for design heavy textbooks.
EPUB
- Create EPUB 3 with EPUB 2 fallbacks
- Align EPUB creation with a modular infrastructure in terms of CSS and scripts (see above)
- Discontinue use of ADE for content checks
Typesetters
- What implications would this change have on typesetter workflows which have been developed to be aligned to CUP’s?
- How long would they need to QA and validate the new model?
- Implications of multiple processes instead of one master for all
- Would the quality of the deliverables (print PDF, POD PDF) be affected?
Workflows
- Change QA to have more emphasis on Schematron instead of parsing
- Multiple QA processes instead of one for Books and Journals
- Multiple Word templates for copyediting instead of one master
- Multiple scripts for conversion to EPUB instead of one master
- Development of CSS. Do we outsource or keep in-house?
- Normalisation & CE - update Word style library and template
  - Consider licensing 3rd party software for in-house processing of references
  - Apply rules for t/s at Normalisation so results consistent across all suppliers
  - Eliminate use of SNBs (style name boxes) during CE stage
  - Update template to align with BITS, e.g. styles should be distinguished via formatting
  - CE in draft view in Word instead of Print Layout view
  - Consider eliminating line numbers
  - Library of styles should be 100% aligned with BITS

After the decision was taken to proceed, an initial schedule of five months was given. This was recognised as an aggressive schedule. Implementation of the first front list book being copy-edited and typeset with BITS took nine months from agreement to proceed with the project. However, since production schedules can be lengthy, we are still encountering new types of content in production, so the Schematron ruleset and transformations remain under development, and it is expected as of February 2016 that the CUP-BITS ecosystem may take another year to settle down.

Editorial

The previous Word template was flat. Codes were created to denote structure (e.g.: extract-bulletlist for a bulleted list within an extract), which meant an ever-increasing list of codes as new structures were encountered. Change was needed to redefine all the copy editorial codes in alignment with BITS and to come up with a scheme that was flexible enough to cater for structured nested content. This would enable a more prescriptive and more consistent deliverable for typesetting across styles of books. Reducing queries about encoding at this stage benefits the whole process.

Conceptual hurdles were present when implementing these changes; editorial staff who had not previously needed to understand many technical content concerns needed assistance in reaching a new understanding of what was required of authors and copy-editors in order to make best use of the new models and processes. Most of these challenges revolved around the division between form and content; for instance, where technical staff are often inclined to be 'lumpers' (i.e. combining related concepts and constructs into a single object), editorial staff often prove to be 'splitters' (e.g. perhaps considering "bulleted-list-in-a-paragraph" to be an inherently different concept from "bulleted-list-in-a-numbered-list").These differences are in many ways persistent, and are often in fact beneficial when taken in combination with one another; even so, the assessment of workflows and processes at a high level has permitted a greater mutual understanding between staff in different areas. In particular, the concept of 'container' styles in the Word template has provided a particularly useful bridge between the world of XML architecture and copy editing; in brief, this approach provides a number of pairs of styles whose sole job is to mark out extents within a Word document (for example "ExtractBegin" and "ExtractEnd") inside which the meanings of generic styles such as paragraphs and lists are to be modulated to some extent.

Typesetter/Production

For expedience, it was decided that suppliers would each create their own conversion script to the Cambridge BITS standard ensuring content complied to the Schematron ruleset provided by Cambridge. The initial intention was to develop a Word-to-BITS processor of our own to provide to suppliers. Owing to time and resource constraints for the project this has not properly been attempted but nor has it been abandoned as an ideal

DTD development and changes

Having decided not to sub-set BITS, the documentation and Schematron needed to be created to restrict the model to fit our requirements. This means that files captured to CUP-BITS standard would still parse against the BITS v1.0 DTD. This would be important for ensuring that our files could, if needed, be understood with relative ease by 3rd parties that might be in receipt of our files.

Platform development

The BITS vocabulary has proven useful in constructing a normalised metadata model common to content expressed in varying DTDs for processing by a single platform. The vocabulary's nature as part of a predetermined and non-local standard promises especial benefits in interactions with external vendors and suppliers, by helping to smooth out uncertainties and ensuing differences in interpretation.

In future, this approach may be improved upon through use of NVDL schemas² to combine structures from disparate models into single documents. This would have, for instance, highlighted - but also potentially obviated! - a validity mistake that was made when expressing book metadata in the BITS vocabulary within non-BITS documents.

Decisions and Implementation

Decisions

Other than BITS a high contender was the use of an XHTML based model. This was discounted as it did not fit with our historical efforts in close validation and semantic value, or with existing, heavily automated, XML pipelines. This option would also have needed more work to achieve the modelling that a DTD provides while relying completely on an even more extensive Schematron ruleset.

The decision was therefore made to move to BITS, rather than redefining proprietary DTDs. BITS, as an emerging industry standard, is closely aligned with the Journals NISO standard, NISO Z39.96-2012, already in use at CUP. It was also felt that as the standard grew, benefits would arise from the input and requirements of other publishers.

Journals had already started to utilise JATS in 2012 and the ability to align content models and coding best practice across Books and Journals was considered a worthwhile goal, particularly as the desire grows to serve books, chapters and journal articles to the end user within the same environment.

BITS and JATS are informed by the structure of academic STM publications, but Cambridge University Press has a high proportion of HSS titles. The DTD was nevertheless found entirely flexible enough to describe content from these fields; moreover, the scarcity of proven and mature models that excel in capturing the wide variety of content configurations in these areas meant that there would have been little benefit in attempting to choose a separate model for them.

While the project managed to apply the new workflows and standards to most of the Academic book products the complex, design heavy Textbooks were not included; the decision was taken to revisit this content once the new workflows had bedded in, as it was clear this stream of content would need particular and deeper attention.

This move importantly complied with the high level strategy to maintain a single source for EPUB and HTML outputs.

The creation of the BITS model was felt best placed in the hands of the in-house XML team, partly because of the fact that once up and running, this team would be required to maintain and develop models and scripts, and partly because one of our staff, Alison Litherland, had been involved in feeding into the BITS standard prior to release. Similarly, CSS work was kept in-house as there was a desire to have a clear understanding of styles applied from print though to EPUB and HTML.

Consideration was given to employing third-party software to encode references in Word to enable early validation, but it was decided that, since typesetters already provided this service during normalization of the Word document, little would be gained for the additional cost.

Implementation

Editorial mark-up codes

The main conceptual innovation in the Word template was the ability to mark up a large variety of word styles using "container" markers, pairs of corresponding paragrpah styles to mark extents within the document; content within these containers is then marked up as usual with standard styles, meaning that structure and nesting could be inferred according to the woder context of particular styles.

For example: a container marker style can be defined for, say, a section of boxed text with beginning and end markers BoxBegin and BoxEnd. Within this, regular paragraph styles can be used such as ParaFirst (for flush right) and Para for subsequent indented paragraphs. This allows the environment of a box to be considered when mapping to BITS and to HTML/CSS structures.

While this created a level of complexity, it did provide a very extensible model that could describe a variety of content without much development. This was obviously a better system, but it should be noted that it is still only as good as the original copy-edit codes applied. Producing rich documentation was still a challenge as is human error.

An analysis of styles and BITS mapping in the context of WordML was undertaken, with feedback given to production to request styles on present in the samples.

The initial Word template of new Editorial Codes provided by Apex was used to create test BITS documents. This in turn helped with the writing of usage suppport documentation.

Rollout to typesetters

Transition to the new DTD standard for typesetters using XML first workflow was a major challenge.- It would not be possible to have all content supplied in BITS after a given date. Books already in production using CBML had to continue along that workflow. Maintaining the old and new workflows was essential but nonetheless had an impact on resourcing. Shortly before the switch-over date, development and maintenance of the existing standard (DTD, Schematron, XSLT for EPUB and HTML output) was frozen, with only critical work being carried out. This helped to limit resource burn but was still a difficult period. Some of the justifications to moving away from the CBML model (such as emergence of problems late in a product's lifecycle - see comments above), kept pressure up to the final stages of implementation, and a few titles that have had exceptionally long production are still arriving in CBML as of early 2016.

Each of the typesetters were given test content to convert and produce XML and validate it against the new Schematron ruleset. They were asked to provide initial samples of available content, and to provide pre-proof XML for around three months after go-live. Iterations of supply of content helped hone instruction and usage notes for documentation.

Challenges faced:

Communication – with different streams focusing on their areas, meeting regularly to ensure different aspects of development were aligned;
Documentation is a large piece of work, and often underestimated, if not ignored. Documentation was developed as samples were exchanged between internal development and from typesetter samples of “real” content during development stages. Moving from project to BAU, documentation is crucial to maintain with latest instructions and forms a solid background for communication with suppliers;
Clarification – checking and rechecking;
Regular phone calls with production and technical staff.

Workflows

The deliverable for typesetting was now a standard copy edited file with element mappings and instructions for creating CUP-BITS XML. Suppliers are provided with the Schematron ruleset to validate before BITS documents are delivered. Files are requested and delivered through the usual routes to our Asset Management System (CAMS); upon ingest the files are validated using the same Schematron ruleset that the typesetters use. A manual QA takes place on the content as part of the automated EPUB scripts. This is useful as it throws up coding inconsistencies and other creative interpretations of the DTD by the suppliers. This process is used for 1st proof files and then repeated for final files, but with a reduced set of manual checks.

Content standards

One major decision with the implementation was to use the full DTD and not control its use by sub-setting as had been done with JATS for Cambridge Journals, but by the use of an extensive Schematron ruleset. This allowed for a variety of content modelling across book types. It was also felt important to provide as DTD rather than RNG or XSD schema.

Specifying the manner of use of BITS elements was not a complex challenge, but required significant time investment. For example, for all “-type” attributes, a list of allowed values had to be defined and appropriate Schematron rules written. These values needed to be identifiable in Copy Edit codes.

Basic decisions had to be made on basic modelling and use. Examples include:

Figures and table to be placed after the paragraph that contains first citation rather than using <floats-group>;
Use <named-content> for page breaks, adding information about page sequence in the named-content id;
Capture half title, title and imprints as plain text, as well as pulling out information into metadata, with Schematron rules to check these match;
Capture chapter author metadata at book level, link author name in the chapters to the main author info at the book level;
Use element x for generated punctuation (between author names etc.);
Use mixed-citation in refs; Use element abbrev to id acronyms;
Maths as MathML and or image, simple inline expressions to be captured as plain text. TeX to be used instead of MathML in Latex workflows;
Use private-char for characters for which there is no Unicode value and use image to capture character;
BITS has the option for capturing tables in OASIS and HTML format. CBML used OASIS, but for BITS implementation HTML was chosen, largely because there was less transformation needed for our main exports of EPUB and platform HTML.

Schematron implementation

Schematron had been used previously with CBML, but by taking the decision not to sub-set the BITS DTD the ruleset became much more extensive. A key aspect of the Schematron implementation was the decision to serve the Schematron ruleset over HTTP and associate it with BITS documents using an xml-model processing instruction . Older versions of the ruleset are retained at static URLs (following the form http://server/YYYY/MM/DD) allowing us to preserve validity of documents delivered to older standards; this was a perennial problem with CBML, as only a single instance of the Schematron ruleset existed, tied to the most recent release of the DTD and therefore often causing older documents to become invalid.

As well as regular control of attribute and text node values, the control of models and contextual restriction of element usage has been achieved through Schematron rules; although full replication of DTD constraints in XPath is a complex and elusive goal, it has been found that, given clear documentation and mappings, it is usually sufficient to impose relatively lightweight constraints on the high-level structure of a BITS document, without drilling especially deeply into the meat of the content.

Careful use of patterns in Schematron has been found especially productive; this permits separation of concerns such that constraints which would have traditionally been expressed in the DTD may be precisely targeted without compromising Schematron's ability to express other constraints. For example, one pattern has been defined that contains rules testing all elements with ID attributes for unacceptable ID syntaxes, while a separate pattern contains rules testing specific elements (i.e.: a subset of all elements) for disallowed element structures.

Schematron rules range from simple and focussed tests of a single element's descendants (see Fig. 3.) to complex constraints referring to external lookups and applying across many elements (see Fig. 4, in which $required.attribs is a variable read from a static external document and containing a list of elements with attributes whose presence are required by CUP but which are merely permitted by the DTD).

Fig. 3

<rule context="book-part[@book-part-type eq 'chapter']"> <report role="error" id="str.book-parts.02" test="descendant::book-part[@book-part-type eq 'part']">Chapters must not contain Parts.</report> </rule> (more...)

Fig. 4

<rule context="*[some $r in $required.attribs/descendant::cup:* satisfies ($r/local-name() eq ./local-name())]" id="str.global.attr"> <let name="att.spec" value="$required.attribs/descendant::cup:*[local-name() (more...)

Where particular edge cases emerge, the need to consider a product's content carefully prior to entering production often proves beneficial across multiple business streams;

Schematron also controls book type specific modelling. For example we might need explicit tagging for dosages and contraindications in medical books. The Schematron ruleset controls this by using rules that modulate testing of an element dependent on the wider context of the document, as shown in Figure 5 below (where the variable $book-type contains the value of /book/@book-type - required by Schematron to be present in every CUP-BITS document):

Fig. 5

<rule id="ilr.judges" context="p[@content-type eq 'judges'][$book-type eq 'ilr']"> <assert id="ilr.judges.01" test="descendant::named-content[@content-type eq 'judge']" role="error">The list of judges (more...)

Difficulties in validation

The only drawback of note to the combined Schematron/DTD approach is in element ordering; although XPath is capable of expressing a sequencing constraint, attempting to use it in this way is inevitably verbose and problematic for maintenance. For example, consider a DTD element model "(title,p+)" - a Schematron rule with the context title might trivially and accurately (albeit without prohibiting other elements) test "following-sibling::p" in an assert; however, now consider "(rb, (rt | (rp, rt, rp))+)" and the rapid increase in complexity is plain.

This leads to a need to treat the DTD as a coarse whitelist of elements which may be present, and Schematron as a finer blacklist of structures which are not desired; this distinction makes demands on both developers and users to distinguish between the two categories of error. We attempt to ease this by using Schematron's role structure:

Generic structures that are valid against the DTD but are either disallowed or required in CUP-BITS (e.g. "Element X must have exactly one child element Y.") are tested in asserts and reports whose role is 'fatal' and presented in QA summaries as bold red text on black;
More specific structures that are required, but must follow a certain configuration (e.g. "When attribute X of element Y has value Z, element Y must have no child elements." have a role of 'error' and are presented in red text;
Uncertain problems which may or may not require attention have a role of 'warn', in the usual Schematron fashion and are presented in amber text;
Use of the 'info' role is made - mainly at the root level of the document - to provide global information about the book (e.g. "This book has 12 chapters.") and are presented in green text.

It is also essential to ensure that CUP-BITS documents are always validated against both the DTD and the Schematron ruleset; this is not necessarily as straightforward a requirement as it appears, especially where complex transformation pipelines are involved.

Modelling

Cambridge has a variety of publications with different styles. As part of the project it was decided to formally distinguish half a dozen book 'streams'; these styles are referred to in the root of each document (<book book-type="XXXX">). This allows for specific models to be controlled via Schematron, which will in turn permit, for example, separate sets of CSS styles to be applied to differing types of content

Schemes for persistent element identification

During the project thought was given to creating a system of element IDs for CUP-BITS that would allow for persistence through multiple iterations of a document with minimal generation difficulty and monitoring overhead. It is intended that CUP-BITS documents can be incrementally delivered (i.e. deliveries may be made of less than the entire content of the book) and that constituent parts may be extracted and reused in other products (e.g. collections of chapters from different books). In order to achieve this, above some structural level (to be determined), elements must have an identity outside the containing document that remains persistent throughout redeliveries and can be transferred outside the original document..

It is assumed that future business requirements may call for direct URI linking into the contents of output formats. It is desirable for storage reasons not to construct a system that requires lengthy strings to be added to each element. Copy-Editorial codes have been used to create these element IDs, on the basis that they provide a pre-XML identity to any given piece of content, and therefore a traceable line of descent from author to output.

Output intentions and specifications

Having imposed a clear line of descent from manuscript to output, through strict mappings of Word style to BITS element to output structure (e.g. HTML elements plus predetermined CSS), it has become possible to have clearer expectations of a book's final electronic form. An example follows:

Fig. 6An example of a line of descent in the CUP-BITS ecosystem.

Table 1

Lifecycle →
Concept	Word style	BITS structure	HTML structure		CSS
A generic text box	08.01 BoxBegin [...other styles as necessary...]08.02 BoxEnd	boxed-text	EPUB	div[@class eq 'boxed-text']	{ border: 0.075em solid #b7b2ad; background-color: #faf9f9; }
A generic text box		boxed-text	Web	section/div[@class eq 'boxed-text']	{ margin: 1em 0.5em 1em 0.5em; padding: 1em 0 1em 0; -webkit-border-radius: 0.25em; -moz-border-radius: 0.25em; -o-border-radius: 0.25em; -ms-border-radius: 0.25em; border-radius: 0.25em; border: 1px solid #b7b2ad; background-color: #FFFFFF; page-break-inside: avoid; }

Efforts are still ongoing to realize this more fully, but it has required careful input from multiple business streams, and the result has been more predictable output; among other benefits, this permits problems to be caught well ahead of them becoming urgent.

Maintaining existing outputs

New scripts were needed to continue the XML to EPUB workflows already employed and controlled by the Press. The new standard of mark-up also needed to be applied to platform workflows - both existing platforms and those under development. Along with content production workflows, consideration was therefore also given to development practices connected to transformation pipelines and QA scripts; this included creation of a more thorough specification for EPUB output than previously existed, clearer internal documentation of development procedures, greater formalisation of change control mechanisms, and other similar concerns.

Reflection

BITS has been implemented for front list content for over a year now and is firmly out of project status and into business as usual.

Resourcing for the project and support for the existing model was challenging. Outstanding development for CBML workflows and transforms needed to be re-assessed and prioritised appropriately to allow resources to be diverted to implementing BITS. Largely this worked, but resourcing, and in particular, the impact it may have on business as usual, for a project like this should not be underestimated..

Don’t underestimate the task: set realistic targets. Making as fundamental a change as changing DTD, especially those embedded in automated workflows, is not trivial. While creating teams to focus on each area of the workflow enabled work to not get bogged down in the big picture, the tasks, specification and development at the granular level needed, were still more than initially expected for the timeframe required. It is advisable to know the tasks in fairly good detail along with available resource before applying a fixed timeframe.

Reviewing the whole workflow forced us to look at the copy-editorial codes. The previous system was flat and ever growing, while the new system, although complex, is considered more manageable. That said there is still an ongoing challenge for the breadth of content we publish; HSS plays, legal case reports, facing-page translations, and STEM titles all require specific treatment, but with the foundations and rules laid out we have a firm basis for developing capabilities as requirements move on.

A well formed model to allow incremental publishing of chapter/book-part content is within relatively easy reach. A number of products are now emerging which expect to be able to publish incrementally - either in a journal-like ahead-of-print manner or as continuous publication of a single product. BITS allows for this by permitting an alternative root element (book-part-wrapper), to contain discrete book-part elements, the model for which is somewhat closely analogous to the article element in JATS, and it is anticipated that this capacity will be crucial to future electronic publishing concerns at CUP.

As services such as Scholars Portal are implementing BITS as their standard for ingesting book content (Zhao et al. 2015), in the same way that such services have employed JATS for journals. Being able to export BITS with little or no transformation (and correspondingly little or no surface for introduction of errors) will be a benefit, to us and to them.

An unexpected benefit of this project was the greater connection and understanding between Editorial and XML development staff. A boon for staff to have greater appreciation for all processes post and prior to their area.

What next?

With BITS now solidly embedded in day to day work at the Press, development of new models, for new content and new functionality requirements has become the norm. Assessing the more challenging content such as the design heavy, media rich textbooks that were not part of the original project will also require attention, as will, among others, dictionaries and online-only content.

It is intended to revisit the Journals JATS model employed with a view to closer alignment with the BITS implementation, most specifically in utilizing the full DTD and control modelling via Schematron. This is expected to be a large piece of work, but probably not as complex, as the production process is quite different regarding options for XML first creation and Copy Editing processes.

References

ANSI/NISO Z39.96-2012 JATS: Journal Article Tag Suite. Published December 2012. http://jats.niso.org.
NCBI/NLM Books Interchange Tag Suite (BITS), Published December 2013. http://jats.nlm.nih.gov/extensions/bits/tag-library/1.0/
Wei Zhao, , H. Ravit, , David Sadia Khwaja and , Lin Qinqin JATS for Ejournals and BITS for Ebooks---Adopting BITS for Scholars Portal Ebook Repository, JATSCON 2015 See (http://www.ncbi.nlm.nih.gov/books/NBK280069/

Footnotes

*: (see ISO/IEC 19757-4 NVDL (Namespace-based Validation Dispatching Language), www.nvdl.org

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK350535

Contents

PubReader
Print View
Cite this Page
Eden M, Cleghorn T. An Implementation of BITS: The Cambridge University Press Experience. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2016.

An Implementation of BITS: The Cambridge University Press Experience - Journal A...
An Implementation of BITS: The Cambridge University Press Experience - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016
SETX senataxin [Felis catus]
SETX senataxin [Felis catus]
Gene ID:101100009
Gene
Auxenochlorella protothecoides 18S rRNA gene, strain SAG 211-7a
Auxenochlorella protothecoides 18S rRNA gene, strain SAG 211-7a
gi|18105|emb|X56101.1|
Nucleotide
Secale cereale psbK, psbI and trnS genes
Secale cereale psbK, psbI and trnS genes
gi|14224|emb|X61674.1|
Nucleotide
Bovine mRNA for desmosomal glycoprotein 2 (dg2) cytoplasmic domain, partial cds
Bovine mRNA for desmosomal glycoprotein 2 (dg2) cytoplasmic domain, partial cds
gi|310|emb|X56967.1|
Nucleotide

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet].

An Implementation of BITS: The Cambridge University Press Experience

Background

Fig. 1An example of CamML, illustrating its verbosity.

Fig. 2Two examples of CBML markup.

The Review

Editorial

Typesetter/Production

DTD development and changes

Platform development

Decisions and Implementation

Decisions

Implementation

Editorial mark-up codes

Rollout to typesetters

Workflows

Content standards

Schematron implementation

Fig. 3

Fig. 4

Fig. 5

Difficulties in validation

Modelling

Schemes for persistent element identification

Output intentions and specifications

Fig. 6An example of a line of descent in the CUP-BITS ecosystem.

Table 1

Maintaining existing outputs

Reflection

What next?

References

Footnotes

Views

In this Page

Other titles in this collection

Recent Activity