Introducing Manuscripts.io: open source editing environment for structured, computationally reproducible research documents

Matias Piipari; Alf Eaton; Alberto Pepe

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2019 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2019.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2019

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2019 [Internet].

Show details

Contents

Introducing Manuscripts.io: open source editing environment for structured, computationally reproducible research documents

Matias Piipari, Alf Eaton, and Alberto Pepe.

Author Information and Affiliations

Manuscripts.io is an open source collaborative editing environment, tailored for scholarly and research papers. It features a structured editor which can validate content against the requirements of a target publication outlet (e.g. a specific article type in a journal), and it allows for embedding authorship metadata, bibliographic references, experimental data, code and interactive visualizations. It ties collaborative projects together with strong content semantics and project management capabilities, in one environment. The effort has as its origin the macOS based editor Manuscripts.app, with Manuscripts.io fully rebuilt for collaboration on the web while retaining the most-loved features of the desktop application.

Introduction

There are many rich-text editors that offer collaborative editing, ranging from Google Docs and Office 365 through LaTeX-based editors such as Overleaf and Authorea, and a wide universe of Markdown-based editors. There are none to our knowledge which combine collaboration with offline editing (with both manual and automatic conflict resolution) and structured rich-text documents and associated document templates to help authors meet requirements for a wide variety of academic journals.

We introduce here Manuscripts.io with a specific focus on three areas:

We showcase the document model that allows for a combination of offline use and online collaboration and advanced document conversion with metadata enrichment features, allowing users to import manuscripts from legacy tools and to enrich them with scholarly metadata.
We demonstrate how Manuscripts.io enables creation of rich scholarly media content – including data, code, and executable notebooks which have become integral components of scholarly documents – paving the way for reproducible research.
We show integrations with electronic editorial office systems (EEOs) which allow content to be transferred losslessly to platforms for peer review management via JATS XML, saving authors effort and time involved in formatting and reformatting their manuscripts at submission time.

Background

The Manuscripts.io project was motivated by shortcomings of unstructured rich text word processing tools in a scholarly context that affect both author and publisher productivity, namely:

Unstructured rich text editing with excessively flexible character level formatting options encourage authors to create documents which mix content with presentational – often page-layout related – concerns, in turn leading to messy markup that limits use in scholarly publications.
Creating citations and bibliographies involves complex workflows for the author, and bibliographic metadata is lost or made incomplete and difficult to parse from rich text documents, in the end leading some publishers to resort to measures such as NLP techniques to find and complete citations from user submitted manuscripts.
Markup serialization formats that are difficult to parse and practically impossible to edit and “roundtrip” successfully, carrying decades of legacy to pre-internet times (DOCX, ODF, RTF variants), and even harder to convert meaningfully to archival formats such as JATS.
Versioning is not generally a part of the writing workflow.
Provenance and attribution metadata is deficient or entirely absent from manuscript drafts.

The typesetting system and markup language LaTeX – used by about a tenth of scholarly authors in the hard sciences – similarly suffers from deficient metadata, the mixing of content with presentational concerns (TeX being a typesetting markup format), and being difficult to parse, analyse and transform (TeX being a Turing-complete programming language) in publication processes. High levels of complexity and document type specific idiosyncrasies are the norm in STEM publishing due to word processing and markup formats and tools that carry legacy that precedes the world wide web and Unicode.

To address the challenges, we made a number of design decisions:

A subset of HTML5 as content markup to benefit from the substantial efforts made by browser vendors and others in HTML-based rich text editing, and enabling flexible document conversion options.
A block-based content schema, with strict restrictions that can be validated computationally for both metadata and content markup, focused to meet scholarly author needs, with interoperability with Word, LaTeX and JATS as a requirement.
Content blocks and metadata stored in a versioned document database (HTML + linked data), with a full local copy of the document available at the edge in a versioned form. In all cases where a part of the document presented to the author is a result of formatting structured metadata, both the structured metadata and its presented form are stored.
Built-in versioning of data at the client end is done to support uninterrupted offline access to the document, whilst supporting online collaboration and document synchronisation with backend services, naturally allowing collaboration with either a central server or following a peer-to-peer model.

Whilst HTML5 was chosen as the content markup format, the metadata in some cases (such as the authors, affiliation and grant funder metadata) was JATS XML inspired, and the ability to import and export JATS XML formatted full-text documents was an explicitly set goal. Many of these ideas were first put in action in the desktop based scholarly writing tool Manuscripts for Mac, of which Manuscripts.io is a complete web browser based rewrite.

Editing

The main features of the Manuscripts.io document editor are presented in Figure 1: a document outliner view is provided in Manuscripts.io to show the tree structure of the documents (“manuscripts”) inside the “project” that contains them, with the main purpose of the outline being fast navigation and reorganisation of the document’s sections and smaller content elements by means of drag & drop. The main editor view features a structured document editor based on the open source ProseMirror library plus a toolbar and menu for performing actions within the document.

Fig. 1

The Manuscripts.io editing environment, featuring a) the toolbar, containing actions for changing the type of the current element, formatting the selection and inserting new elements, b) the project outline view, showcasing the ability to edit multiple manuscripts housed inside a project (in this case, the main JATS-Con article being written, as well as authors’ own notes) and c) the structured document editor view.

The document follows a structure corresponding to the model of JATS and HTML5: multiple sections, each with headings and subsections. Each section contains one or more of the fundamental Manuscripts “elements”: currently paragraph, list, figure, table, equation or code listing. These element types correspond to the model of JATS and HTML5 (with markup and metadata schema chosen to allow also for reasonable degree of DOCX and LaTeX interoperability). For example a figure wrapper can contain a caption and multiple figure panels, each with their own caption.

The schema of the document is described using ProseMirror’s “node spec” definitions: like a DTD, these define a node type’s attributes, the types and allowed order of a node’s children and various other properties of a node type that are relevant when editing. The output of a node as HTML is also defined, and this is used both when rendering a node in the editor (when a custom node view hasn’t been defined) and when a node is copied to the clipboard.

A sample ProseMirror node spec, describing the attributes, content and HTML representation of a “section” node.

        <code>export const section: NodeSpec = {
  content: 'section_title (paragraph | element)* section*',
  attrs: {
    id: { default: '' },
    titleSuppressed: { default: false },
  },
  group: 'block',
  parseDOM: [
    {
      tag: 'section',
    },
  ],
  toDOM: node => [
    'section',
    {
      id: node.attrs.id,
      class: node.attrs.titleSuppressed ? 'title-suppressed' : '',
    },
    0,
  ],
}</code>

Equations are inserted and edited as LaTeX, using CodeMirror for linting and syntax highlighting, then rendered to SVG using MathJax. Citations can be inserted into the document via search and import from an in-app internal reference library, as well as directly from external sources such as Crossref or DataCite (with additional sources under development), and bibliography formatting is performed using CiteProc (citeproc-js). Bibliography metadata is stored in the CSL-JSON format, with a limited set of application-specific additional keys specified in the Manuscripts schema.

Fig. 2

The in-application citation insertion experience, allowing searching references from both the user’s own reference library, as well as external sources (in the case presented in the screenshot, Crossref).

Executable assets

Manuscripts.io documents are dynamic: figures (and soon tables) can include executable code – currently Python, Ruby or Julia – to allow for computational reproducibility, with code execution done using Jupyter kernels.

To aid reproducibility of the illustrations in a manuscript, the source data for a figure or table can be attached, as can the code used to generate the figure image or table data. By passing this data and code to a remote Jupyter kernel for execution, Manuscripts.io provides a full pathway for generating reproducible figures which can be regenerated with updated data by re-executing the same code in a known, controlled, isolated environment. Manuscript embeddable Jupyter notebooks (.ipynb) with automated versioning of notebook code and associated datasets is also under development.

Fig. 3

A screenshot of a dynamic figure, showing the code editor, language selector and output image.

Document templating

When creating a manuscript, the user can choose from thousands of supported journals to use as a template. This encompasses the article’s section requirements and instructions, citation style, word/character counts and other limits for various parts of the article. Validation rules are run as the document is being edited, with warnings presented to the users for content restrictions that the document fails to validate against.

Fig. 4

The Manuscripts.io template selector, which allows a user to create a new manuscript based on an article type specific template for a supported journal.

The inspector panel in the sidebar of Manuscripts.io provides an interface for editing manuscript/section/element styles and settings, such as choosing a citation style from the extensive library maintained by the CSL project. This is also where paragraph styles, figure layouts, and other parts of the manuscript can be customised.

Journal submission integration

Documents created in the Manuscripts.io editing environment include author-level metadata, such as name, contact information, affiliation information, and article-level metadata, such as title, abstract, and keywords. These metadata can be added to documents by authors via a set of ad-hoc modular components built on top of the editing application. Provided article-level metadata (title, abstract, keywords) can contain Unicode characters and LaTeX-based mathematical notation. The author-level metadata includes for example:

Author first name and last name
Corresponding author indication with email address
Indication of joint authorship
Author order
Multiple affiliations with individual affiliations specified by institution name, department, street address.

For imported documents, the Pressroom document conversion service uses GROBID, an open source machine learning library, to extract these author-level and article-level metadata automatically. Extraction results can be corrected by authors using the aforementioned modular metadata editing components, also shown in Figure 5.

The motivation for a JATS exportable metadata model and document body was to allow for author and publisher workflows to meet: to create a seamless experience for submitting work for peer review and later publication stages. An in-application integration with journal submission systems is in the works (technically a rewrite, since such an interface was already developed for the purposes of Manuscripts for Mac).

Fig. 5

Author metadata editing components in use in Atypon’s Submission Desk application built using open source Manuscripts.io modules. This demonstrates the modularity of the environment and its JATS XML encodable attribution metadata model.

Importing and exporting

Manuscripts.io can import documents from Word (DOCX) or LaTeX (as a ZIP archive), via a purpose-built web service (Pressroom) that uses Pandoc and custom templates for conversion to the “Manuscripts Project Bundle” format: a ZIP archive containing the project components as JSON and their associated files (figure images, etc) in a data folder. The JSON file contains a collection of objects, each with their own ID, that together make up the Manuscripts metadata — extracted automatically using GROBID, where possible — and document tree.

Once imported, the document objects are stored in a Couchbase database, and synchronized to the web client via Couchbase Sync Gateway, which validates data and permissions for reading and writing at a per-object level, in a role-based manner (with roles implemented in the system at the time of writing being “readers”, “writers” and “owners” who can manage sharing). This per-object level granularity of synchronisation allows multiple authors to edit different parts of the document concurrently while avoiding many of the conflicts that would be caused by multiple writes to a single, large document; conflicts do still happen in this model especially given the system’s offline use friendly functionality, with user visible workflow provided for resolving them.

The schema definition used by ProseMirror allows node types to be mapped to output formats other than HTML, e.g. JATS XML. As JATS XML is roughly similar to HTML5, the mapping from ProseMirror’s internal data structure to JATS XML is fairly straightforward, with just a few post-processing steps to add the metadata and to ensure the appropriate order and attributes of some node types. This export happens entirely in the browser client, which produces a ZIP archive of the JATS XML and associated data files that can be downloaded when working offline.

For exporting to other formats, Manuscripts.io uses the Pressroom web service which wraps Pandoc and purpose-built XSL templates for conversion to DOCX, LaTeX and PDF. Pandoc’s recent addition of JATS XML as an input format opens up the exciting possibility of a single output pipeline starting with JATS XML and leading to all the other supported output formats. However, Pandoc’s legacy means that its internal data model is limited to that of Markdown (with a few extensions), which is not sufficient to represent all the structure used in a Manuscripts document, so a custom conversion pipeline is used to produce each output format from the JSON objects in the Manuscripts Project Bundle.

Collaboration

Authors can be invited to a project (which contains a group of manuscripts, e.g. a draft for a journal article and a cover letter) by sharing a link or sending an email. Once the invited author has signed in and accepted the invitation they will be able to either view or edit the document, depending on the permissions given to them by the owner. They can also leave comments on parts of the document, to which other authors can reply. This can all happen while all authors are offline, with changes being synchronized via the central database as the authors re-connect.

Manuscripts.io on desktop and mobile

Manuscripts.io can be installed to a user’s local computer, from where it can be launched similar to a regular separately windowed application using the browser as its security sandboxed runtime environment. This is accomplished by the application having been implemented as a progressive web application (PWA) with a ServiceWorker that caches all the resources for offline use, beside a manifest file describing application metadata and icons. The installation feature works with modern versions of Google Chrome, MS Edge and iOS Safari.

Modularity & licensing

The editing environment, its backend web services, conversion tools, and document model (schema) constitute a large collection of separate source code repositories and modules available under <uri>https://gitlab.com/mpapp-public</uri>. The great majority of the source code is licensed under the Apache 2.0 license.

Future Directions

Manuscripts.io is an open source project to create a collaborative environment for a modern research author; ease of use combined with a versioned, structured document model, rich in metadata, with computational reproducibility of results and JATS XML encoding compatibility is at its core. It is modular in nature, offering both browser and desktop clients. We are working towards the 1.0 release, with source code for client and server-side systems available, as well as a working test environment (in which this document itself was drafted).

Past the 1.0 release, we intend to focus further on new ways of enhancing the authoring experience for scholarly documents, welcoming also partners and 3rd party contributors to the codebase. We aim to enhance the project management features included in Manuscripts to make it effortless to work with large writing projects, to deepen the extent of computational reproducibility of Manuscripts authored content, and to meet the needs of research discipline specific customisations to the writing environment – by means of integrations to 3rd party reference databases, data repositories and ontologies.

The copyright holders grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK540955

Contents

PubReader
Print View
Cite this Page
Piipari M, Eaton A, Pepe A. Introducing Manuscripts.io: open source editing environment for structured, computationally reproducible research documents. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2019 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2019.

Introducing Manuscripts.io: open source editing environment for structured, comp...
Introducing Manuscripts.io: open source editing environment for structured, computationally reproducible research documents - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2019

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2019 [Internet].

Introducing Manuscripts.io: open source editing environment for structured, computationally reproducible research documents

Authors

Affiliations

Introduction

Background

Editing

Fig. 1

Fig. 2

Executable assets

Fig. 3

Document templating

Fig. 4

Journal submission integration

Fig. 5

Importing and exporting

Collaboration

Manuscripts.io on desktop and mobile

Modularity & licensing

Future Directions

Views

In this Page

Other titles in this collection

Recent Activity