U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2017 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2017.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2017

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2017 [Internet].

Show details

Beware of the laughing horse: Managing a back-catalogue conversion

.

Author Information and Affiliations

Background

The International Standardization Organization (ISO) is an independent member-based non-governmental organization with 161 national standards bodies. ISO brings together experts to develop voluntary consensus-based International Standards. These are then disseminated through the ISO national membership.

In 2011, ISO embarked on their XML journey, with the following aims:

1.

Creation of a central repository of standards

2.

Improve speed to market also for national adoptions

3.

Broaden readership

4.

Reduce or avoid duplication of costs

5.

Streamline ISO production processes

The base DTD chosen was JATS and customizations were made to be able to capture standards-type metadata and content. This became known as the ISOSTS (standards tag set).

The first acid test of the DTD was to create the central repository with content in a common form, ie, convert ISO's legacy content from Word/cPDF/scanned PDF to ISOSTS-compliant XML.

How ISO went about this task is the subject of this paper.

Aim

To convert over 30 000 (750 000 pages) standards (EN, FR, RU, and SP) into XML in two years.

Method

An RFP was launched early 2011 for potential providers of XML conversion. Two providers were shortlisted and after site visits by the then project manager, the director of IT, the director of standards development and the Secretary-General of ISO, one provider was chosen, Innodata, with the conversion team based in Sri Lanka.

Theory

The contract and pricing had already been agreed upon, this contained the set-up period of three months, with a view of mass conversion commencing January 1st 2012 and ending December 2013.

Contract

The contractual work took some time as we had some two non-negotiable factors:

1.

The conversion should take no longer than 2 years

2.

The quality of the output requested was higher than what is usual industry practice

Estimates

Although our database had a pretty accurate record of the number of pages in our legacy content, we had no idea as to the actual content make-up of those pages. The number of pages to be converted was in the region of 750000 pages.

Indeed, the contract listed different prices according to the type of page to be converted (Word, cPDF, scanned PDF) and the actual content in those pages, see below some of the categories (prices are not shown for obvious reasons):

Table 1– Price categorization

                        Service(s) to be billed                        Price(s)                        Unit
                        Conversion Processing (categorized by Source Type)        
            PDF Scanned      $xxPer page
            PDF Character-based      $xxPer page
            MS Word      $xxPer page
            SGML      $xxPer page
                        Tables to CALS or XTHML(categorized by Source Type)        
            PDF Scanned      $xxPer Kbytes
            PDF Character-based      $xxPer Kbytes
            MS Word / SGML      $xxPer Kbytes
                        Tables to PNG      $xxPer image
                        Equations to MathML (categorized by Complexity level)        
            Simple      $xxPer Equation
            Moderate      $xxPer Equation
            Complex      $xxPer Equation
                        Equations to PNG      $xxPer image
                        Images for Conversion      $xxPer Image

We had to come up with some estimates for the numbers of tables, equations and images in our content to set our budget. The following estimates were made:

Table 2– Estimates of content type

Content typeNumber of unique instances
Tables40 000
Equations15 000
Images76 000

After calculation of the global cost, we added a 25% contingency sum to the budget.

Performance standards (Service Level Agreements)

For this Project, the accuracy requirement were defined as follows:

1.

For the Text: 99.995% accuracy, ie, no more than 5 error characters in every 100 000 characters;

2.

For the Tags: 99.95%, ie, no more than 5 error paired tags in every 10 000 paired tags; inclusive of tags in formulae and tables.

3.

All material marked up according to coding instructions.

4.

Processed data is 100% fully-parsed to zero errors against ISOSTS and the supplementary schematron validation rules.

It is important to note that the vendor was not requested to mark-up the metadata of the individual files. The metadata was injected as a post-process after files were received.

The vendor set up a separate QA team using batch-sampling according to the ANSI sampling tables to check for errors and calculate the rate. It was interesting to note that the processing team and QA team were physically separated. Part of the QA process included mined errors, used to assess the work of the quality assessors.

After delivery of batches, we had one month to validate the content of the batch or it could be rejected. The unit of rejection was a batch, so for instance, if perhaps only one or two documents were felt to be of poor quality, the whole batch would be rejected and reprocessed in full.

Set-up period

A selection of over 200 standards were to be used during the set-up period. These were sent in smaller batches (of around 10 documents or approximately 250 pages), marked up according to coding instructions, delivered and reviewed by ISO. In total around 12000 pages we selected for this initial phase, which was deemed representative of the 750000 pages.

ISO recorded all errors in an excel sheet and returned this to the providers. All feedback was incorporated into the next batches.

At the end of the set-up period, it was agreed that:

1.

ISO would confirm the final DTD and coding instructions.

2.

The vendor would demonstrate that it was capable of delivering the quantity and quality of documents required by redelivering the final version of each pilot document as a single quality validation batch.

And at the end of the set-up period, and prior to mass conversion, ISO could terminate the agreement if:

1.

The vendor had not demonstrated that it was capable of achieving the agreed upon quality requirements;

2.

The vendor and ISO were not able to reach mutual agreement on the operational arrangements with respect to the quality assurance (QA) methodologies to be applied during live production.

Milestones

The project was planned according to the following timeline:

1.

October-December 2011 – Set-up period (coding instructions and DTD finalized)

2.

January-December 2012 – Mass conversion with 30% of back-catalogue converted by year end

3.

January-December 2013 – Mass conversion with full catalogue converted by year end

Practice

The main author of this paper was appointed 1st November 2011 as project manager to lead the work. This meant that already, the contractual set-up period was now down to 2 months. After reviewing the contract, and getting familiar with the whole project, one sentence seemed out of place:

'The extracted text content shall not require proofreading or editing after extraction.'

It was by now too late to change this and to request proofing from the vendors, as the budget had been approved and no increases would be granted. The contingency sum was not meant to cover for this.

It had been felt that as the content going in was digital and the output also digital, nothing could go wrong. Equally, no resources had been set aside to carry out internal proofing. The author having performed a conversion 10 years prior to this knew this to be a false assumption, and would result in poor quality XML/content. Thankfully, some funds were set aside for an additional internal resource to assist the project manager in the quality control/assurance task of proofing.

Set-up period

The set-up period consisted of an iterative process of marking up the same set of content, reviewing and sending feedback to the providers to then re-markup and return.

Sources files were MS Word, cPDFs and scanned PDFs. These were batched up from our existing repository in batches of about 10 documents (or 250 pages) per batch. These were sent to the vendor via FTP and once processed returned to ISO via the same FTP. The batches would be downloaded and unpacked with all related project information managed through an MS Access database.

During this period, the coding instructions were drafted and refined using the feedback from the internal proofing as well as technical queries from the vendors when coming across exotic structures. Also, schematron rules were developed to perform additional validation on the content, see http://www.iso.org/schema/isosts/

Finally, to help with the proofing, a CSS was created to render the XML in HTML, colour-coding some of the main elements for ease of checking.

The proofing proved to be invaluable and generated very specific feedback that helped towards the elaboration of solid coding instructions and of a DTD that was relaxed enough to accommodate those exotic structures. Pragmatism is key. Once all the files of the set-up period had been checked, we ran this process again. And just to make sure we had everything as right as we could before pushing the big red button that is mass conversion, we performed a third and final iteration. We felt ready.

Mass conversion

Once the third iteration was over, I went on a site visit to finalize the coding instructions, meet the team and also have a go at converting a standard to understand the process (and where it could go wrong). This also proved invaluable. It was interesting to see that one document would be deconstructed according to its content. So for instance, tables would be extracted and sent to the table-coding team. Math equations were sent to the math team to rekey every equation in a proprietary software, which exported MathML 2.0. The images were cropped by one man, cropping images all day. Trying all of these processes was a real eye-opener and led to some suggestions on improving some of the processes, for instance a second pair of eyes on the cropped images.

During the visit, mutual agreement was reached to freeze the coding instructions and to start mass conversion - it was to start May 1st 2012. We had the added pressure for that year that the then Secretary-General had announced in 2011, that by the end of 2012, 30% of the back-catalogue would have been converted.

With this in mind, batches were initially prepared for the mass conversion by going back through the years chronologically. It was agreed that batches should be composed of documents totaling about 625 pages. However, in practice this did not work out very well, as the first batches were composed of very large documents (so one per batch). This meant that progress would be slow.

Hence, we had to totally rethink our batching strategy and devised some batching criteria to ensure that we would meet this 30% milestones but also for the conversion team get used to the structure of standards. Therefore short documents (less than 20 pages) were first batched up. Equally, easier file formats were prioritized, so the initial batches were short MS Word documents. As short documents ran out, the batching criteria defined Word documents between 21 and 40 pages, and so on. This made up 10000 documents more or less. And luck would have it that this represented 30% of the back-catalogue….

The same approach was taken when batching up cPDF and scanned PDF documents, short ones first and then gradually increased page numbers.

Batch rejection!

As we entered the mass conversion phase, we did encounter some files in batches that needed correcting. Was it worth sending a whole batch back to be fully reprocessed and then fully rechecked? This is what was stipulated in the contract, but we felt it was simply not a workable approach if we were to finish this project on time. In discussing this with the vendors, we agreed that the unit of rejection would be a document and not a batch. We also performed small corrections ourselves using Oxygen, still recording those as feedback for the conversion team, but at least we were moving forwards.

Not all errors are equal

Again, the quality criteria in the contract put all errors in one basket, but could we really say that a missing non-breaking space was as severe an error as an incorrect operand in an equation? Again, with pragmatism in mind, we developed a detailed document categorizing as many errors as we could think of and assigning weights to each.

A snippet of the table is shown below

Fig. 1. – Categorization of errors snippet.

Fig. 1– Categorization of errors snippet

This approach was again pragmatic and sensible to make sure batches were not rejected because of 5 minor errors in 100000 characters.

Schematron

During the set-up period, schematron rules were developed by ISO to check certain facets of the XML. Of course, at the outset, although we tried to think of as many checks as possible, we could not foresee all that would be useful to check for. Therefore, during the lifecycle of the project some additional rules were created and deployed. As painful as it was we ran the new rules on all previously delivery XML files, and any error was then corrected manually. This rework obviously took extra effort and time as we had to do it all internally. But at least, we knew that the whole corpus of content had been validated against the exact same set of rules.

Beware of the laughing horse

We were hitting our stride and dare I say getting into cruise control. All the systems were up and running, the feedback loop was effective, correcting small errors ourselves made us advance apace. Indeed, that somewhat unrealistic target of 30% converted before the end of 2012 was becoming achievable.

It was the end of September 2012 and we were thinking about moving to a sampling approach for the proofing. This is before my colleague came to me with a problem she had spotted in one of the deliverables. The problem was that a plotted graph had been replaced by a picture of a laughing horse, see below.

Fig. 2. – the laughing horse made its way into our back-catalogue.

Fig. 2– the laughing horse made its way into our back-catalogue

Of course, it was amusing at first, but not for long. The extracted content was supposed to be tamper-proof, I had been assured of this during my visit and had experienced it when marking up my own document. So what happened? This was the result of a dare between colleagues, where one wanted to prove he could bypass the system by placing a photo, with a view of removing it straight away. Unfortunately, that did not happen. After discussion with the vendor's management team, it transpired that the team member in question was one of the best elements of the team, but regardless, he was removed from the project.

This made everyone more careful. On the vendor's side, they tightened their internal processes. On our side, well, we did not go into sampling until some months later.

Even with a seemingly well-oiled machine, you cannot plan for human genius or in this case folly.

Without insisting on proofing resources at the outset, we would have certainly not caught this and of course, not been able to correct or refine the process to ensure the quality output we did receive.

Easy-peasy so far, enter the PDFs

With the bulk of the Word files having been processed by the end of 2012, we had batched up cPDFs and scanned PDFs. Obviously, during the set-up period, some of those file types had been submitted and the OCR process tried and tested. Also, at the end of October, we had sent a couple of PDF batches to make sure the vendor could handle batches and all seemed to work fine.

Now with having large volumes of PDFs batches (character or scanned), the conversion team struggled with both quality and delivering volumes. The quality issues were mainly due to the fact that the scanned PDFs were of very poor quality hence had to go through two separate OCR processes. Also, in our catalogue, the documents with cPDF only as source are usually from other organizations and contain unusual levels of complexity.

Progress slowed down as the vendors dropped monthly deliveries from around 15000-20000 pages per month to under 10000 pages per month.

From April 2013, more resources were assigned to the project by the vendor which led to a sharp increase in the number of pages delivered to try to finish on time.

The final straight

The last few months of the project were labour-intensive for all parties. The volumes received peaked to about 50000 pages per month for the last quarter of 2013. We had started implementing a sampling approach to the checking, filtering out documents that had only text and figures. Documents with tables and equation would still get checked albeit by focusing on those elements only.

Despite all of these efforts, we did not complete by the end of 2013. In terms of percentage of documents converted, we were at over 98%. However, the last 2% represented nearly a fifth of the total number of pages, and were complex scanned or character PDFs. Those took us another 3 months to complete from January to March 2014.

People and project management

People

During 2012, the number of persons working full time on the project was two: the PM and a proof-reader/editor. The Proofreader had no prior knowledge of XML but learnt on the job very quickly. The PM had survived two migrations to an XML-first workflow and had had a pretty rough experience of a small conversion performed in 2001. In 2013, the initial proofreader moved on to another position, and the project manager, having started another outsourcing project, was assisted by two proofreaders.

The vendor had a solid structure made up of project manager, technical lead, production manager and a team of initially 40 people, working in two shifts. This team grew towards the end to over 100. The project was closely followed by the head of operations.

Project & Management

This was not an IT project. It was a cross-disciplinary project involving IT, standards development and sales and marketing. It had the full support of ISO's top management and was very closely followed by the ISO membership.

Daily contact between the teams was essential in the first phase of the project. In addition a weekly call was set up to discuss issues further. Even later on in the project, when less needed to be clarified or discussed, this meeting would take place to keep up the relationship, give encouragement and ensure we were all moving in the same direction. The need to go beyond 'it's just business' when dealing with an offshore vendor cannot be emphasized enough. Cultural intelligence goes a long way to build a successful and long-lasting relationship.

Budget and time

Well, we spent the whole budget and the 25% contingency. But there were some good reasons for this, as the actual numbers for tables, images and equations were far more than anticipated, see Table 3.

Table 3– Estimates vs actuals content-type

 EstimatesActuals
Number of tables40000156000
Number of equations15000180000 mml, of which 72296 equations
Number of images76000138000

Finally, the project overran by 3 months into 2014. This was of course partly due to the longer set-up period (doubled from 3 to 6 months) and although we had foreseen issues and put in place extra resources, the sheer complexity of the content and disparity between estimates and actuals meant more time was needed to ensure delivery of the same quality output.

Lessons learnt

RFP

During the RFP process, we heard statements such as 'we can provide 100% accuracy and 100% automation'. This is simply impossible, and Innodata were one of the more open providers initially offering 98% accuracy and being quite upfront this would be a semi-automated process, ie, a large part of the operation would require manual tagging.

Quality

We did not simply throw our content over the fence and loaded the processed work back into our systems blindly. Trust has to be earnt and investing so much time, effort and money into the internal proofing was paramount to the success of this project.

XML for multiple purposes

We embarked on this project with a view to use our XML for the online browsing platform (www.iso.org/obp) and did not necessarily consider that we, or our members, may want to republish old scanned PDFs from the XML. So for instance, all images were saved as .png only, and with hindsight, saving as .eps too would have helped us downstream. It is important not to forget about future uses of your XML.

Batching and sequence

For reasons mentioned above, we first started off with the most recent documents and worked backwards. Source files were invariably Word and easier to convert. Progress was quick, volumes delivered made us believe we could finish the project early. And in fact, once we started sending batches of cPDF and scanned PDFs only, the process slowed right down. Although we had tested the process with some files, when faced with whole batches of these file types, the vendor had to increase the team size to guarantee delivery volume. Also the process was amended somewhat to ensure the quality was upheld. In hindsight, we should have perhaps sent one PDF-only batch for every two Word-only batches from the outset, so that the provider would have been able to foresee the issue early on.

Tags cost money

During the set-up period, we selected a number of documents that included our best sellers. Then we just went chronologically backwards without necessarily looking at sales or whether the document was currently being revised. I think that was a mistake as during the two-year period a number of converted files were withdrawn and we should have foreseen this. Also, when looking at the detailed billing every month, some large documents had cost over 2000USD to convert. Again, perhaps these documents should have been looked at more carefully in terms of priority.

Legacy and guesswork

Finally, I believe that we could have had a lot less surprises if we had spent perhaps 3 months and some money for the vendor to carry out an inventory of the whole catalogue in terms of number of tables, images and equations. This would have helps with planning, prioritization and budgeting. Don't underestimate what is in your content!

Conclusion

The project was completed on budget but ran over by 3 months. The XML delivered met the high quality criteria defined at the start of the project, namely:

1.

For the Text: 99.995% accuracy, ie, no more than 5 error characters in every 100 000 characters;

2.

For the Tags: 99.95%, ie, no more than 5 error paired tags in every 10 000 paired tags; inclusive of tags in formulae and tables.

The ISOSTS passed the conversion acid test with flying colours. Indeed, although we did come across exotic document structures in older documents or documents from other organizations, we could find an appropriate way to mark them up. We sometimes had to be pragmatic and not entirely semantically correct but these instances were rare.

We were lucky that the providers chosen lived up to our expectations, and although we encountered glitches on the way, both parties worked hard to find timely solutions. It cannot be emphasized enough how important it is to foster a partnership to do such a project properly. Considering it as just a service contract without investing yourself personally in the human relationships would not result in success.

Finally, this conversion was one of the cornerstones of the ISO XML project. In the knowledge continuum one can consider PDFs as islands of knowledge, conversion to XML builds bridges across these islands. This linked knowledge begins to form patterns and from there, if one categorizes or organizes those patterns, one can derive wisdom (or so they say!). XML makes this possible and opens the door to new opportunities, ideas and solutions.

Copyright Notice

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK425542

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...