NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.
Nature Publishing Group/Palgrave Macmillan has over a million articles, 180 journals, three in-house DTDs, numerous workflows and production systems, as well as teams based across the world. The challenge: How do we move to using JATS as our single DTD, introduce a streamlined production process and increase the number of journals we publish? This paper describes our journey so far, the challenges we faced, the XML tools we have used, the decisions we made and the reasons for them, and the work still to be done.
A note on terminology
We have tried, as much as possible, to avoid using company jargon and project names in this paper. However, it has been necessary to refer to our in-house DTDs by their names - "NPG" (used by Nature-branded research journals) and "AJ" (used by academic journals).
Also, to save space and avoid confusion, where this paper uses the name "JATS", it refers to both NLM v3.0 and JATS v1.0. We started development work on our projects in March 2012 when NLM v3.0 was the current version of the DTD. We switched to using JATS v1.0[1] when it was published in August 2012.
Introduction
In the late 2000s the journal article publishing system in Macmillan was a fragmented collection of disjointed and disparate publishing tools, written in different languages, with no common architecture, which required many manual steps to complete article publication. Similarly, there were three in-house article DTDs inherited from previously separate business units, which were poorly designed and documented. The article XML produced was inconsistent and error-prone. There was a recognition by the company that if it was to achieve its business goals of improving our user experience and decrease the costs of publishing, it would need to address all these issues. An ambitious program was initiated to build a new coherent publishing platform based on modern service-based architecture, developed using Agile Scrum methodology, test-driven development and new data models. A key feature of this development was that the components were to replace the legacy tools in stages, ensuring the ongoing article publication flow was uninterrupted. This meant that the new components would need to integrate with the existing platform.
A significant aspect of this program was to develop a mature journal article data model, which catered for all our data structures, whilst being flexible, extensible and comprehensible. We have adopted the JATS Journal Publishing DTD as it fulfils all of these requirements. This paper describes the process of selecting JATS, the components of our new publishing platform which interact with our XML, our project to convert our legacy XML to JATS, and the challenges we have encountered.
JATS mappings, the early years
Journal publishing at Macmillan Publishers in 2008
- Two publishers: Nature Publishing Group and Palgrave Macmillan
- 143 journals, > half a million articles
- Disconnected in-house publishing systems in different languages (Perl, Java, Ruby, .NET, XSLT, Apache Velocity)
- Online publishing teams in different countries and time zones (UK, USA, India, Japan)
- Development publishing teams in different countries and time zones (UK, USA, India)
- Five XML suppliers
- Three custom/proprietary DTDs, poorly documented, poorly designed
- Unnecessary journal-specific constructs
- Reliance on visual QA of content
The list above shows the complex nature of journal publishing at Macmillan in 2008. The business wanted to deliver an increasingly high-quality experience for its customers and improve the workflow efficiency, while expanding the number of journals in its portfolio. There was an understanding that the publishing systems we used were based on old technologies which were not suitable for delivering these objectives. One difficult aspect was that the various publishing tools were fragmented, requiring numerous manual steps to complete the publication workflow. These led to frequent failures and challenges in identifying and fixing the problems.
Why NLM DTD?
The business recognised that in order to achieve these objectives, most if not all of the publishing system tools would need to be rebuilt from scratch, and that this would be simplified if there was a single XML schema. The existing custom DTDs came from different business units which had been merged to form Nature Publishing Group, and had been developed by a number of different people over the years. They were developed without a consistent design approach, and as a result had a number of significant problems which ruled them out of consideration.
Developing a new DTD/Schema from scratch was also not a serious option. We wanted an existing DTD that we could adopt and adapt to our needs with minimal effort. The NLM DTD was already being used by other publishers, and was under active development, so it seemed an obvious choice.
Feasibility report, 2008
The XML team contracted Paul Appleby to produce a feasibility report for converting the two main NPG DTDs to the then version 2.3 NLM DTD. The report was based on analysis of the DTDs only, without reference to any XML content. The report proposed mappings for every element and attribute, suggested extensions where no direct mapping was possible, and customisations to add extra validation for restricted value lists. The report stated that extensions were required for about 50 elements in each source DTD, and proposed that Schematron rules could be used to avoid customisations.
Early in 2009, the decision was made to adopt NLM 2.3 as the target DTD. This would require substantial extensions to the DTD for NPG-specific functionality. Unfortunately (or fortunately given the later release of NLM 3.0), we had no project approval from senior management to proceed.
DTD-based mapping to NLM 3.0, 2010
In 2010, we contracted Heather Rankin, who had previously led the XML team, to create a full mapping of NPG and AJ DTDs to NLM 3.0. Heather’s greater knowledge of our actual use of elements allowed improved mapping and identified some unused elements. More importantly, NLM 3.0 was a significant improvement on NLM 2.3, providing greater coverage of NPG DTD elements, and greatly reducing the number of required extensions. The addition of the @specific-use attribute in particular, the introduction of <custom-meta> and improved element structure, all helped to cater for our existing data. Heather also examined the usage of elements in our XML, identifying some unused structures (for example <isbn> within front matter) and recommending removal of elements used for content processing.
Content-based mapping to JATS 1.0, 2011
In 2011, the XML team was completely re-staffed, and we were asked to review and finalise the mappings. We quickly discovered that the previous mappings had not taken into account the actual use of the AJ and NPG elements in the XML. We found numerous examples where the actual use of an element differed from its definition in the DTD, impairing the previous mappings. For example, the AJ DTD <sc> element (Script text) was actually used for small caps. We also eliminated more unused element structures and combinations. Where previous mappings had made recommendations or suggested further research, we made final decisions based on our analysis. We were able to entirely eliminate any requirement for extensions to the DTD, and also migrate to JATS 1.0 when it was released. We thank Bruce Rosenblum and Alex Brown for their helpful input.
One of the outputs of this work was a long list of content errors or "dirty" data. This is where elements were used for the wrong purpose, contained incomplete or invalid data, or contained data properly belonging in other elements.
With ongoing staff changes and competition with other work, it took a whole year to complete the analysis and mapping.
Why JATS? Why Journal Publishing?
In his 2008 report, Paul Appleby focussed on mapping to the NLM 2.3 DTD. However, his report briefly discussed the case for either retaining the existing in-house DTDs or selecting another DTD. These were serious alternatives as NLM 2.3 did not cater natively for all the data structures we had. NLM 3.0/JATS was a significant improvement and does cater for all those structures, with minimal use of generic elements like <named-content>. This difference made it the only practical choice.
We chose the blue Journal Publishing DTD because it offered just the right level of structure. One of the prime aims of migration to a new DTD was to simplify the code required for processing the XML, and we felt that the Archiving version was too loose, and would have required extra Schematron rules to enforce correct markup.
Other choices
We chose to use the OASIS Open Exchange CALS table model instead of the HTML table model as it had been incorporated into both AJ and NPG DTDs and so we had already been using it for some years. We also felt that it was slightly better as it catered for table subsections (we felt that nested tables as in the HTML model were not quite equivalent). We also decided to implement Schematron validation into our production pipeline, as it offered additional validation, great error reporting, facility for applying business rules, and was easy to integrate.
Why did it take so long to get started?
- Completion of DTD mapping
- Staffing changes
- Company re-structure
- Lack of buy-in from senior management
- Project approval board changes
- Poor adoption of Agile development process
- Competing projects
- Focus on building the Content Hub
We had completed the DTD mapping to JATS by early 2012. Since then, we have gone through a number of organisational changes which blocked the inception of the project. The biggest of these was the re-structuring of Macmillan which led to changes in the project approval processes. In one way these delays were frustrating, but they also led to many benefits:
- Increase in staffing
- Incorporation of the project in company roadmap
- Adoption of Agile Scrum development process
- Test-driven development
- JATS replaced NLM
- Better system architecture - development of the Content Hub
- Improved conversion workflow
- Embedding of an XML DB (MarkLogic) into the publication workflow
Program Inception
In March 2012, a wide-ranging and ambitious program was launched with the aim of improving customer experience. "Customer" had three meanings - the subscribers to our journals and content; editors and external societies who want to publish new journals; and Macmillan staff who use the systems we develop. Each customer had different, but often overlapping, needs.
Subscribers' needs
1. To easily read articles
Analysis of our website suggested that most users download the article pdf to read, rather than reading the online version. There were a variety of reasons for this:
- ability to read the article "on the go" - our website did not resize well on tablets or mobile devices, and only a few of our journals were available on our IOS mobile app.
- ability to annotate a paper copy
- the pdf was the "version of record": a lot of time was spent on checking that the pdf was correct, and there would often only be a minimal check of the online version, which resulted in that version occasionally having inaccuracies
We also had a variety of designs for our journal and article pages. This was due to the way we created new designs over the years. A new design would often only apply to articles in the NPG DTD, as the cost of developing against two DTDs at once was prohibitive. New designs were not deployed retroactively. This was for two reasons - article pages were manually generated, so would need to be reprocessed; but more importantly, we didn't have the resources to check that the new version of the article was a faithful rendition of the original.
This resulted in big differences between how article pages look, even within the same journal. Nature, as our flagship journal, got most of the new designs and functionality. Other journals had to wait months, or even years, until they were upgraded. Our online-only open-access journals were usually at the bottom of the priority list for upgrade, and their design had not changed for many years.
2. To find what they're looking for quickly and accurately
When running a search, the results page took more than 10 seconds to load. Once the page was loaded, the 'correct' article would often be low down on the list. As a result of these two issues, only 10% of searches resulted in a user clicking through to an article page.
Editors' and Societies' needs
1. To launch new journals quickly and efficiently
From initial concept through to full launch of a journal site with articles could take many months. As well as publishing our own Nature-branded journals, we also host society-owned journals on our website. Our publishing managers who work with societies wanted to offer fast journal set-up times, in order to increase our income.
2. To have access to the latest designs and functionality
Our rather old, tired designs for society journals were not helpful when trying to convince a society to publish their journal on our site. New functionality was also unlikely to be added after the initial journal launch. Increased competition from other publishers meant we needed to give societies the very latest our site could offer.
3. To reduce time from acceptance of an article to publication
The number of manual steps involved in the production process, poor validation and system failures all added to the time it took to publish an article. Authors want to be able to see their article in a journal as soon as possible.
Macmillan staff
1. To launch new journals quickly and efficiently
To set up a journal required a number of people to log work tickets with numerous teams, in order to get the journal details added to our different systems. Delays could occur if there were discrepancies in the details, or if a dependent task was not done first. Journal flat pages were created by the production team based on instructions from editors/publishers, and the journal microsite design had to be created from scratch each time. If an editor or society required new functionality for their journal, this also increased the amount of set-up time.
2. To remove manual processes for day-to-day tasks
A lot of our production staff time was wasted on:
- moving files around our systems
- manually validating XML files in oXygen
- emailing typesetters with validation reports and/or fixing repeated errors in XML files
- manually correcting HTML files to make the online version of an article correct, and
- sometimes "fudging" the XML markup to make the online version appear correct, as it would take too long to fix the XSLT conversion
By automating our systems as much as possible, production staff would have time to use their skills and knowledge to ensure that our article content and journal microsites are correct.
The projects
In order to start meeting some of these needs, the inital goals of the program were:
- introduce a JATS XML workflow for all new content
- convert archive XML to JATS
- create a new publishing platform which renders articles in JATS XML directly from MarkLogic
- reduce new journal set-up time
The plan was to do most of the work using in-house development teams. Other teams worked on the journal set-up tool and publishing platforms, whilst our team started work on the JATS migration.
Introduce a JATS XML workflow for new content
Existing workflow
We focussed on the latter part of the production process - submission of XML, through correction to publication. Later parts of the project would deal with the pre-XML workflows. There are many different production workflows in use at Macmillan. This is due to the variety and number of journals we have, as well as the different functionality that these journals support. A simplified, generic workflow is given below (see also Figure 1):
- Once a typesetter has finished work on an article, they use FTP to send article XML and assets to us, as well as sending an email confirming delivery. Most of our typesetters are in Asia, whilst three of our Production teams are in the UK and US. This means that, because of the time difference, the article can sit on our FTP site for a few hours before being picked up by Macmillan staff.
- A production editor downloads the article XML and assets, validates the XML against the relevant DTD and checks all the expected assets are present. Depending on the journal, any required corrections are done in-house, or requested from the typesetter. Again, the time difference can cause further delays. Once the XML is valid and all assets are present, they are manually saved to the fileserver.
- Tools are run which populate our relational databases with article information; and convert high resolution images into all necessary sizes. Once these are completed successfully, HTML pages are created - full article, figures/tables, abstract only etc.
- Once all this is complete, a visual check of the online article can be performed. For every correction required to the XML, HTML pages have to be recreated and checked again. Again, if corrections have to be sent back to the typesetter, another delay ensues.
- Once an article is finally signed off, more tools are run to ingest the XML to MarkLogic and generate XML for third parties, such as Crossref and PubMed. On publication day, another manually-initiated tool copies the XML to live MarkLogic and puts all the article assets and HTML pages on the live fileserver; third party files are sent.
Fig. 1Existing workflow
Target workflow
Our end goal was to eliminate as many manual steps as possible from the production process. We also wanted to remove the "staging" environment, so that all article XML and assets are stored in a single "Content Hub". Publication dates would determine which articles are shown on the live website. See also Figure 2.
- Typesetter submits JATS XML and assets via an asset service. The article XML is automatically validated against the JATS DTD. If it passes, it is then validated against our Schematron rules. If it fails either stage, an error report is returned. Submitted assets are verified as being declared in the article XML, and missing assets are also flagged. Valid article XML and assets are stored automatically in the "Content Hub".
- Legacy tools would be removed - image resizing done 'on-the-fly' by the rendering process; article information obtained from the Hub rather than a relational database; HTML article and figure/table pages also generated 'on-the-fly'.
- Webcheck can take place as soon as the article XML and assets are in the Content Hub. The check and correction cycle would still be fairly manual, but individual assets can be resubmitted to the Hub as soon as they're fixed.
- Once the article is signed-off, a publication date/time is scheduled. Once this date/time is reached, the article automatically becomes visible on the live website. An automated creation and dispatch of files for third parties also takes place at this point.
Fig. 2Target workflow
But how to get from our existing workflows to our target flow? We had 180 journals, two current DTDs plus one archive DTD, five suppliers, and only one development team – a big bang just wasn’t going to work.
Proposed interim workflow
The first suggestion was for all articles for all journals to be submitted as JATS XML and validated, but for all other production systems to be retained. The flow would be (see also Figure 3):
- New step: Typesetter submits JATS XML and receives a validation report
- Typesetter sends other article assets via FTP.
- New step: Valid JATS XML is stored in MarkLogic. The article is transformed to the required in-house DTD and stored on the fileserver.
- The production editor checks all assets are present and saves them to the fileserver. Legacy tools are run, HTML pages created and a webcheck performed. All corrections are done in-house on the NPG/AJ version of the XML. Third-party XML is generated.
- New step: NPG/AJ XML is transformed back into JATS and ingested back into staging MarkLogic.
- Live publication processes are as before, but with JATS XML ingested to live MarkLogic.
Fig. 3Proposed interim workflow
This meant our team had two interdependent project goals:
- create four sets of XSLT to convert: JATS to NPG; NPG to JATS; JATS to AJ; AJ to JATS, and
- build the tools to support this flow
The development of workflow support tools and transforms was run in parallel streams using Scrum[2] agile development methodologies.
1. Creating XSLT
A backlog of anticipated transforms was built up from mapping documentation. Samples of minimal article content that would validate against AJ, NPG and JATS DTDs were taken as starting points. Structures of increasing complexity classified as simple, moderate and complex mappings were then gradually added to the backlog.
Contract XSLT developers were recruited to develop the transformations in-house once the backlog of transforms mappings had reached an acceptable size. For each element mapping, samples of JATS XML were mocked up along with corresponding samples in AJ or NPG. In addition to developers, Quality Assurance testers also utilised samples to verify transform code met expectations and build a regression suite to catch any mistakes.
Test driven development[3] was adopted to ensure that transformation code base was developed to a high quality.
XMLUnit[4] was chosen as a unit testing framework as it enabled easy integration into our existing Jenkins[5] continuous integration system. The success of a transform was determined by checking for the presence of a predetermined XPath on the result of a transformation on a given input XML.
In addition, XMLUnit was utilised in creating integration tests. It provided a means of comparing the result of transforming a whole article XML against target samples provided, and confirming if they are identical.
Verifying that whole article transformation results were valid against target DTDs was challenging initially, as AJ, NPG and JATS DTDs were stored in a separate codebase. It was solved by implementing a bespoke referencing mechanism to the XML catalog files contained within the schema codebase.
The need to convert from JATS to Legacy and then back to JATS also provided a quick way of testing the transforms. Comparing input JATS against final result XML would highlight any areas of data loss and missing transformation templates. A Java tool was also created to round trip through the transforms by taking either JATS or legacy as input and creating a report of the differences found between input and final versions.
Challenges
The necessity to meet on-going business needs during the project meant that only one member of the core Macmillan XML team was available as both a Product Owner and Subject Matter expert to guide development of both transforms and workflow tool development. Ensuring that there was a sufficient backlog of transformation mappings with examples for contractors to work proved to be problematic. This was recognized by the business and a dedicated Product Owner was appointed towards the end of 2012.
In addition to coding for mapping between DTDs, the transformations also incorporated code to apply corrections to incorrect mark-up / "dirty" data. However, applying the correction would mean that the content or structure of resulting XML would not roundtrip and differencing could not be used to verify the accuracy of the transform. This was mitigated by adding unit tests that only asserted the output of a single transformation.
2. Building the tools
XML Gateway
From March 2012 to January 2013, we worked on creating an "XML Gateway". This had a simple browser-based user interface, where XML articles could be uploaded. It allowed the submission of single articles, or multiple articles within a zip archive.
The XML would be sent to the validation service (see below), and a success or error report would be displayed on screen. These reports were also stored and were available to be retrieved from the 'report history' tab, if required.
Another page on the site hosted our company-specific JATS Tagging Instructions (see below).
Once an XML article passed validation, it was stored in MarkLogic.
The JATS XML is sent to the transformation service, which returns XML in the required legacy DTD. This version of the article XML is then saved to the fileserver. Once the production team have downloaded the rest of the article assets, they are stored in the same location as the transformed XML. (See below for more information on the transformation service.)
As we continued work on the XSLT and the XML Gateway, we began to have concerns about using a round-trip process to convert articles back to JATS after correction. The potential to introduce semantically-incorrect markup in the AJ or NPG XML, in order to make an article look 'right' online, was quite high. This could cause the XSLT to fail completely, produce invalid JATS XML, or ignore unexpected elements. Fortunately, the business decided against introducing JATS for all journals, and to only use it initially for new journals. These new journals would be hosted on the new publishing platform, so we did not need to support a lot of the old production processes.
The XML Gateway was used extensively during 2013 for loading JATS articles needed for testing the new publishing platform. It was also used for loading 'real' articles for our first two JATS-based journals (Molecular Therapy — Methods & Clinical Development and Horticulture Research), from December 2013 until March 2014. It was deprecated in March 2014 with the completion of the "Content Gateway".
We knew that the XML Gateway was only going to be a temporary submission system, so we created standalone services for validation and transformation, which could be reused by future applications.
Validation service
The validation service takes a JATS article and first validates it against the DTD. If an article is invalid, an error report is generated and no further validation takes place. If an article is valid against the DTD, it is then validated against the Schematron rules.
If multiple articles are submitted at the same time, the service iterates through them and produces a report for all articles. This report is split into three parts: articles which fail DTD validation; those which pass DTD, but fail Schematron validation; and those which pass both validation stages.
Although we had not used Schematron for our AJ/NPG XML validation process, we realised that it would be invaluable in reducing XML correction time in a number of ways:
- allow us to enforce company-specific rules about JATS mark-up
- detect errors in mark-up that would require a long time to check manually, e.g. a reference to citation number "16", but with an <xref> @rid of "15"
- incorporate the ontologies as look-up files, to verify that journal metadata is correct and that only allowed subject terms are used
- working with the production team, we can identify repeated errors and add a Schematron rule to stop them.
Transformation service
The transformation service takes JATS XML and converts it to the required legacy DTD using the XSLT developed by our contractors. We wanted to minimise the amount of information hard-coded in the Gateway and associated services, so we use an XML config file to map each journal id to either the AJ or NPG DTD. This config file can be deployed independently of the Gateway codebase and therefore reduces the risk of errors being introduced, as well as reducing the amount of time needed to add new journals.
The transformation service also has the capability to transform AJ or NPG XML to JATS, as this was an early requirement of the project. This has never been used due to the concerns about the round-trip process. However, the AJ/NPG to JATS XSLT are now being used to support the archive conversion, so the work has not been wasted.
Content Gateway
From June 2013 to March 2014, we worked on creating a "Content Gateway". As well as JATS XML, it is able to handle all assets associated with an article, and remove the manual steps involved in processing them.
An ftp folder was created which accepts a zip archive and an associated 'done' file. The archive can contain single or multiple articles. A manifest file should also be included which lists all the filenames of the contents of the archive - article XML, article pdf, figure/table images, illustrations, supplementary material etc. This manifest also contains at least one email address to which validation reports can be sent.
Article XML is sent to the validation service, as before. If it passes validation, then additional validation checks are carried out on the article assets. These check that the article pdf, plus all images and supplementary material declared in the article XML, have been submitted. They also check if any unexpected files have been included.
If an article fails at any validation step, then a failure email with a report, is automatically generated and sent to the typesetter. If the articles within a zip are all valid, then a success email is sent.
Once a zip has passed validation, JATS XML is stored in MarkLogic, and transformed XML stored on the fileserver, as before. In addition, the article assets are also stored in the relevant folders on the fileserver. At this point, a storage success/failure email is sent to the production team.
As we no longer have a browser-based interface for hosting the JATS Tagging Instructions, we decided to make our GitHub repository public. This means that our typesetters and third-parties always have access to the latest version of the Tagging Instructions, plus our Schematron rules and in-house DTDs.
Current hybrid workflow
We now use a 'hybrid' workflow which has reduced the number of manual steps in our production process, but still has to support some legacy applications (see Figure 4).
- Typesetter submits JATS XML and all article assets and receives a validation report
- Valid JATS XML is stored in MarkLogic. The article is transformed to the required in-house DTD and stored on the fileserver. All other article assets also saved to the fileserver.
- The production editor runs some legacy tools and a webcheck is performed. All corrections are done in the JATS version of the XML (either by the typesetter or in-house), and the whole article is resubmitted to the Gateway. Once the article has been signed-off, third-party XML is generated from the AJ or NPG version of the article.
- Live publication processes are run as before, but with JATS XML ingested to live MarkLogic.
Fig. 4Current hybrid workflow
Content Hub
The new Content Gateway, Validation and Transformation services have been connected to previously existing services such as Content Ingestion service, Triplestore and Hub API to create a scalable architecture that makes the Macmillan Content Hub.
The central piece of the Content Hub is a MarkLogic Database which holds all articles published on Nature.com and Palgrave-Journals.com. XQuery was leveraged within the Content Ingestion Service to build an index of normalised metadata to enable fast searching across all records regardless of which DTD the content was in.
An integral part of the Content Hub is the NPG Ontology, which is a continually expanding set of taxonomies and vocabularies covering the data in our Graph. These cover publications, journals, subjects, article types, article relations, publication events and document components. These ontologies are referenced by our workflow tools, and in particular by our validation service to ensure data validity of journal titles, subjects, article types and article relations. The ontology also drives our ETL process populating our triplestore with RDF data. Some of this data is then incorporated into our article metadata XML held in MarkLogic, to be used for search and article rendering. Further details on the development of the ontology and our linked data architecture can be found in a paper presented to the 13th International Semantic Web Conference[6]. [See Figure 5.]
Fig. 5Current state of NPG Core Ontology
The Big Picture
Our work was not carried out in isolation. Other developments have made significant improvements to the way we manage our content.
New publishing platform
The new publishing platform has two major pieces of functionality:
1. Article pages are now generated "on the fly", and we no longer have physical HTML files stored on the fileserver:
- the platform requests JATS articles from MarkLogic and generates HTML pages, using JSON as an intermediate step
- it incorporates an image-resizing tool, so that production teams no longer have to create images in different sizes with our legacy tools.
- it pulls in additional article information from the Content Hub to generate links to related articles.
- it has been designed so that content resizes 'nicely' on PC, tablet, and mobiles.
- it has simplified urls (www.nature.com/articles/article-id) to make it easier for our customers to find content. This url format has also been applied to work with legacy content, so the url will resolve to the relevant article url.
Table 1 shows the improvement in web performance that the new platform has given us.
2. An internal administration interface:
- editors can choose the layout of their journal homepage based on templates
- they can also upload and edit their own flat pages, journal-specific information, cover images etc, rather than requesting production time to do this.
The platform was originally designed to support our open-access online-only journals.
With the success of these new journals, it was decided that subscription-only, issue-based research journals would also be hosted on the new platform. The administration interface was developed further to allow editors to create issues and add articles to an issue. This information no longer has to be added into the XML by the typesetter, as the issue information is stored in the Content Hub and XML updated automatically.
We are now starting the process of transitioning all our legacy journals to the new platform. The archive conversion will run in parallel with this process.
New platform = new journals
From January 2014 to January 2015 we launched 10 new journals on the new platform, including:
- four open-access online-only society journals
- Scientific Data - our first open-access, peer-reviewed publication for descriptions of scientifically valuable datasets
- npj Primary Care Respiratory Medicine - the first in a new series of partner journals
- Palgrave Communications - the first Palgrave title
- Nature Plants - our first issue-based, subscription research journal
Archive conversion
The initial plan was for the work to be carried out by an external vendor during 2013/2014 using our mapping documents to convert our legacy content. We realised there would be a number of problems with this approach:
- duplication of effort - we were writing XSLT transforms to create JATS from legacy content. There seemed to be little point to ask an external company to do it too.
- our company-specific requirements for JATS mark-up were still developing and being refined. We did not want to convert a vast amount of content and then not be able to use without further work.
- lack of resources for QA - all members of the XML team were involved in other projects and it would be difficult to provide support to the QA and production teams during the conversion process
- at that stage, the new publishing platform was an unproven product - it was uncertain if other journals would make the switch to JATS and the new platform
Now that the we are ready to start the transition of legacy journals to JATS and the new platform, we have adopted an alternative approach to the conversion:
- the Content Hub team have recently developed an "asset service" to return article XML from MarkLogic. This means that the new publishing platform no longer queries MarkLogic directly, but just requests an article from the asset service.
- we have added the NPG/AJ to JATS XSLT to MarkLogic. If a request is made to the asset service for a legacy article and a "type=jats" header is given, the article is transformed "on the fly" and a JATS version is returned. This means the new platform can now display legacy articles with no additional development work needed by that team.
- the conversion will be done on a journal-by-journal basis, with additional transform work taking place as required
- we will use an external vendor for performing a visual QA - comparing the original article on the old platform with the JATS-based article on the new platform. Once we're happy that no data has been lost, we'll physically convert the articles to JATS and store them in MarkLogic. At the same time, our typesetters will start supplying JATS XML for that journal
We are at the very start of this process, and hope to report to a future JATS Con on the challenges and successes of this approach.
Search
The development of "new" search based on our Content Hub API has significantly improved this facility for our customers:
- The results page load time has been reduced from 10 seconds to less than 1 second.
- there is ongoing development work by the Content Hub and Search teams to refine the results pages, so that more relevant results are displayed first
- whilst still in beta (with only 10% of users automatically using it), click-through rates increased from 10% to 25%.
"New" search is now available as the default for all our users and we expect to see click-through rates continuing to improve.
Product set-up tool
A product set-up tool was created to improve the amount of time taken to launch a new journal:
- there is a simple admin interface which allows an editor or publisher to add all relevant journal information
- the tool generates work tickets automatically containing only the information required by other teams
- the progress of these tickets can be tracked through the tool interface, and any delays dealt with quickly
As a result of this tool and other program work described above we have:
- reduced the time to create a marketing site from two months to two weeks
- reduced the time to create a full journal site with articles from eight months to four weeks
Opportunities, problems and challenges
Company-specific Tagging Instructions
In the past, attempts were made to document mark-up requirements for NPG and AJ XML. These were hard to maintain due to the frequent addition and modification of elements. Also, different production teams would adapt the use of elements and therefore there has been inconsistency in how they were used. The most recent version of documentation for our legacy DTDs dates from 2007.
With the switch to JATS, we have ready-made documentation as covered by the JATS Tag Library[7]. In addition to this, we maintain a set of company-specific tagging instructions following a similar format to PMC's tagging guidelines[8]. As we develop models and functionality for articles hosted on our new platform, we create additional tagging instructions.
The instructions are backed up by the Schematron rules, both of which are available to our typesetters from our public GitHub repository.
Standardization of XML and article asset storage
Our legacy articles are stored in MarkLogic based on their location on the fileserver. Online-only articles were stored as if they were part of a volume and issue, and we did not wish to continue this 'logic' with our new JATS articles. We simplified and standardized the storage location to be /journal-id/year/article-id/xml in both MarkLogic and the filesystem. This allowed us to build our systems based on this logic, rather than having to hard-code for each different journal.
As part of the ingest to MarkLogic, metadata is generated to hold the fileserver location information for image assets, supplementary material and article pdfs. Again, this could be automatically generated based on the logic we had created for the article location. This metadata allows our new publishing platform to know where to look on the fileserver to get these assets.
MathML
Our NPG DTD did not support MathML. Our AJ DTD did have MathML but the old platform can't render it, so it has never been used. With the introduction of JATS, we decided that we should use MathML rather than images for all formulae.
The new publishing platform uses MathJax to render the MathML and our customers no longer have to put up with poor, inaccessible images.
In order to support conversion of MathML in JATS XML to our legacy NPG DTD, we had to:
- add support for MathML to the NPG DTD. Our standard practice for legacy DTDs is to not use namespaces. However, if we did this for MathML we would have had a clash of element names (<abs>), so decided to maintain the MathML namespace.
- add this updated version of the NPG DTD to our legacy systems - this proved problematic due differing folder structures used by the systems. In order to ensure all the systems continued to parse the NPG DTD properly, we had to create two different ways of referencing the MathML DTD.
- Fortunately, no such extra work was required to convert MathML for the AJ DTD.
Continuous modelling
While the XML team was integral to planning and developing the tools and XML transforms, we also had to deliver XML modelling for new features and new journals. A particular case in point was the journal Scientific Data, which required a number of new XML structures:
- Data citations - title, author list, repository identifier, deposit identifier, year
- ISA-tab (study metadata) file - linking to the article and it's sections
- Study parameters - selected data from ISA-tab file copied to XML file (see Figure 6)
Fig. 6Modelling of Study Parameters for Scientific Data
All these required modelling in JATS for rendering on the new platform, and modelling in and conversion to NPG DTD for sending to third parties such as PubMed Central.
In addition, a special application was written to synchronise some content between the ISA-tab files and the XML. The Study Parameters and related articles were copied from the ISA-tab file into the XML, and the article title, DOI, keywords, licence, contributor details, submission and release dates and supplementary file information was copied from the article XML into the ISA-tab files.
Four DTDs
With the gradual transition to JATS we are now working with three live DTDs and one archive DTD.
As part of the ingestion of article XML into MarkLogic, we use XQuery to create an additional layer of metadata which is standardized across all the DTDs. This information is used by our old search and publishing platform, which query MarkLogic directly.
Example XQuery to identify corresponding authors in JATS and NPG XML:
let $corresponding-authors := element meta:corresponding-authors { for $author in $article/article/front/article-meta/contrib-group//contrib [@contrib-type="author"][@corresp="yes"] return <meta:author>{jats:get-author-children($author)}</meta:author> } declare function jats:get-author-children($author as node()) as node()* { ( if($author/@corresp eq "yes" ) then attribute corresp {"yes"} else (), element meta:first { $author/name/given-names/fn:string() }, element meta:last { $author/name/surname/fn:string() }, element meta:full { fn:string-join( ($author/name/given-names, $author/name/surname, $author/name/suffix) ," " ) }, for $email in $article/article/front/article-meta/author-notes/corresp[@id = $author/xref[@ref-type eq "corresp"]/@rid]/email return element meta:email { $email/text() } ) }; ____________________________________________________________________________________________________ let $corresponding-authors := element meta:corresponding-authors { for $author in $article/article/fm/aug//cau return <meta:author>{npg:get-author-children($author)}</meta:author> } declare function npg:get-author-children($author as node()) as node()* { ( if( fn:node-name($author) eq xs:QName("cau") ) then attribute corresp {"yes"} else (), <meta:first>{$author/fnm/fn:string()}</meta:first>, <meta:last>{$author/snm/fn:string()}</meta:last>, <meta:full>{fn:string-join( ($author/fnm/fn:string(), $author/snm/fn:string(), $author/suff/fn:string()), " ")}</meta:full>, for $email in $article/article/fm/aug/caff[coid/@id = $author/corf/@rid ]/email return element meta:email { $email/fn:string() } ) };
with both returning the standard output:
<meta:corresponding-authors> <meta:author corresp="yes"> <meta:first>Yasushi</meta:first> <meta:last>Ishida</meta:last> <meta:full>Yasushi Ishida</meta:full> <meta:email>yaishida2009@yahoo.co.jp</meta:email> </meta:author> </meta:corresponding-authors>
New search and the new publishing platform do not query MarkLogic directly, they use article information returned by the Content Hub API. The Content Hub creates an additional layer of 'semantic' metadata based on the ontologies. Any article fragments are standardised to JATS mark-up, regardless of the original DTD, e.g.
{ "article": { "id": "nplants20151", "titleXml": "<article-title>Plant hormones: On-the-spot reporting</article-title>", ... "hasContributor": { "id": "rainer-waadt-nplants20151", "type": "contributors", "name": "Rainer Waadt", "isCorresponding": true }, ... "hasSummary": { "bodyXml": "<abstract><p>The development of a new jasmonate reporter further extends the tools that add greater detail to the investigation of plant hormones. Such reporters for the various types of plant hormones, exploiting different aspects of their activity, will help us to eventually study hormone signalling, distribution and dynamics in intact tissue.</p> </abstract>", "hasSummaryType": { "id": "standfirst", } }, ... "hasFigure": { "id": "nplants20151-f1", "captionXml": "<p><bold>a</bold>, Expression-based reporters consist of hormone-responsive element (HRE) repeats as the sensory module (SM) fused to a minimal p35S promoter and a reporter gene. Hormone signalling triggers transcription factor (TF)-binding to HREs and reporter-gene expression. <bold>b</bold>, Protein-degradation-based reporters are fluorescent (FP) or luminiscent (for example luciferase, Luc) proteins fused to a sensory module that is recruited to the hormone-specific SCF-complex (ASK–Cullin–F-box) in a hormone (H) concentration-dependent manner, ubiquitinated (U) and proteolysed. <bold>c</bold>, FRET-based reporters consist of a hormone sensory module flanked by a donor and acceptor fluorescent protein FRET-pair. Hormone-binding triggers structural and spectral changes in the reporter.</p>", "titleXml": "<title>The toolbox of plant hormone reporters.</title>", "hasImageAsset": { "id": "nplants20151-f1.jpg", } }
Transformation mappings for archive conversion
Other challenges were presented by the way our legacy tools were designed to render articles. For example:
- if the article XML contained an empty "conflict of interest" statement, our HTML generating tool would automatically insert the standard statement for that journal
- if all the authors of an article are only affiliated to one institution, the explicit <xref> link can be omitted in the NPG XML. The HTML generating tool will automatically generate the link on the article page.
We did not want to continue hard-coding this type of information for our new publishing platform, so we updated our NPG to JATS XSLT accordingly: the text of the 'conflict of interest' statement and affiliation <xref>s are now added to the JATS XML during the transformation process
Similarly, "author contributions" are marked up as a section at the end of the body in NPG XML. Our transforms now convert this to the correct <fn> @type="con" in the <author-notes> section.
We have also taken the opportunity to remove some complexity and duplication in the legacy DTD. For example, the NPG DTD has <weblink> and <uri> elements, which have been used in the same way. Rather than mapping them to the separate <ext-link> and <uri> JATS elements, we decided to map them both to <ext-link> with @ext-link-type="uri".
"Dirty" data
As part of the mapping of our DTDs to JATS, significant analysis of our XML content was done to determine the actual usage of defined elements. During this and subsequent data analysis, a lot of instances of "dirty data" were revealed. These is where the XML is invalid, or the XML is valid, but an element is misused or contains data properly belonging in another element. Some examples include:
- Invalid nesting of a list within a list instead of a list item
- <sc> (script element) used insead of <scp> (small caps element)
- Both English and French abstracts held in the one element
- Linking elements with no valid link
- Author notes in affiliation elements
To avoid introducing complexity to the QA process, we have decided to fix these only when:
- The source XML is invalid
- The structure cannot be mapped to JATS
- The fix is simple to incorporate in the XSLT, or be done prior to transformation
The remaining "dirty" data will be passed through to JATS for later triage.
The Next Generation
The development of our new publishing platform continues apace, with the evolving Content Hub at the center. The XML team are part of the team delivering the Content Hub and developing our data models and workflows. Currently under developement are data structures and API calls handling article and asset versioning, to support development of new production workflows. The new asset service API is also being developed to allow submission of XML and assets separately. After that, we will be assisting in developing a tool for dynamic delivery of our data to third parties such as PubMed and CrossRef, and a tool for improved analysis and enrichment of our citation references.
Meanwhile, we will continue apace with the conversion of our existing journal archive to JATS, and migration of journals onto the new publishing platform.
Conclusion
Macmillan is investing heavily in building a modern, scalable publishing platform, and JATS is the XML content model at its core. We have listed our reasons for choosing JATS and some of the challenges we had in mapping our existing DTDs to JATS. We have described the components of the platform that process or interact with our XML, and the improvements in our publishing workflow they confer. We have related some of the challenges we encountered, and what we have learned from them. As described in papers previously submitted to JATS-CON by other publishers, conversion to JATS can't rely solely on mappings derived from in-house DTD. Realistic mapping can only be compiled from analysing mark-up contained within real world samples. We have begun the enormous task of converting our archive content to JATS, and it is hoped that the time and investment spent in developing the tools described in this paper will this conversion to be rapid and smooth.
References
- 1.
- NISO JATS v1.0 http://jats
.niso.org/. - 2.
- The Scrum Guide. http://www
.scrumguides .org/scrum-guide.html. - 3.
- Test Driven Development. http://en
.wikipedia.org /wiki/Test-driven_development. - 4.
- 5.
- Jenkins. http://jenkins-ci
.org/. - 6.
- 7.
- JATS Tag Library. http://jats
.nlm.nih.gov /publishing/tag-library/1 .0/index.html. - 8.
- PubMed Central Tagging Guidelines. http://www
.ncbi.nlm.nih .gov/pmc/pmcdoc/tagging-guidelines /article/style.html.
- The Long Road to JATS - Journal Article Tag Suite Conference (JATS-Con) Proceedi...The Long Road to JATS - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015
Your browsing activity is empty.
Activity recording is turned off.
See more...