Creating JATS XML from Japanese language articles and automatic typesetting using XSLT

Hidehiko Nakanishi; Toshiyuki Naganawa; Soichi Tokizane; Tsuyoshi Yamamoto

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet].

Show details

Contents

Creating JATS XML from Japanese language articles and automatic typesetting using XSLT

Hidehiko Nakanishi, Toshiyuki Naganawa, Soichi Tokizane, and Tsuyoshi Yamamoto.

Author Information and Affiliations

A Japanese-language journal has been converted into the JATS XML format, and typeset automatically via XSL-FO to produce both the printed issues and online journals which are published on the J-STAGE e-journal platform in full-text HTML. As there is no established XML workflow tools available for Japanese language journals, the Nakanishi Printing Company has developed its own workflow using AH (Antenna House) Formatter.

AS STM journals are by-and-large in international standards even in Japanese-language, typesetting is fairly straightforward. Still, there are several challenges in processing agglutinative languages which are common in Asian counties such as Japanese, such as identifying family names/given names in a name string, or inserting “Zero Width Joiner” to avoid unfavorable line breaks. Also we had to develop individual XSLT for each article to position tables and figures rightly. As we go on and work with humanities journals we should face more challenges.

Introduction

Not all research articles are written in English. In countries other than English-speaking ones, higher educations and scientific researchers are conducted in their native tongue and thus articles are submitted in non-English languages. Such articles are not even using Latin alphabets, but Chinese characters, Korean Hangul, or Thai alphabets, for example.

According to the study conducted by the National Institute of Science and Technology Policy (NISTEP), the ratio of STM articles in Japanese were 25.6%. J-STAGE, an E-journal platform operated by the Japan Science and Technology Agency (JST), published 29,813 Japanese-language journal articles vs. 17,182 English-language ones in 2013, i.e., 63.7% were in Japanese. In addition, most of the humanity/social science research articles, which are typically published in university journals, are naturally in Japanese rather than in English. Searching NDL-OPAC which contains various articles published in Japan, revealed that there were 47,888 university journal articles in Japanese in 2013 while 5,048 in English, i.e., 90.4% are in Japanese¹⁾.

As JATS 0.4 (formerly NLM DTD 3.1) introduced so-called multi-language capability in early 2011²⁾, it has been possible to tag such Japanese-language research articles using JATS. J-STAGE now officially supports JATS 0.4, and encourage publishers to load their papers in JATS.

Multi-language articles on J-STAGE

The first such journal in JATS that appeared on J-STAGE was The Japanese Journal of Gastroenterological Surgery: JJGS³⁾. Figure 1 and 2 show top pages of a sample article in Japanese and in English. J-STAGE has a toggle feature for readers to switch between a Japanese page and English page to take advantage of this.

Fig. 1

A sample article page of the The Japanese Journal of Gastroenterological Surgery (JJGS) on J-STAGE in JATS (https://www.jstage.jst.go.jp/browse/jjgs/45/7/_contents/-char/ja/)

Fig. 2

The same information as in Figure 1 in English.

Figure 3 shows its body text page of this article. Although the body texts are in Japanese (Kanji and Kana) for this article, figure captions are presented in English to help international readers to get the idea such as Fig. 1 and 2.

Fig. 3

The body text page (in Japanese) of the same article as in Figure 1

Also, article titles, author names and affiliations, abstracts and keywords are prepared both in Japanese and in English. Such multi-language presentation of article meta data is coded using corresponding “alternatives” tags such as <name-alternatives> of JATS (Figure 4). NLM-DTD allowed to repeat the <name> tag, for example, so that it was possible to code multiple expressions of a single name in different languages. But such practice did not clearly show that such multiple expressions belong to a single person or to different person. A wrapper, such as the <name-alternatives> tag finally allowed us to distinguish such cases.

Fig. 4

A sample multi-language expression using <name-alternatives>

In the example of Figure 4, an author name is expressed, one in Japanese as: “中西” and “秀彦”, and another in English as “Nakanishi” and “Hidehiko”.

The language of the element value is defined using “xml:lang”. J-STAGE asks publishers to use the value “en” and “ja-Jpan” for “xml:lang”. The list of such “alternatives” we use are in Table 1. For elements which do not need such disambiguation, such as <abstract> and <kwd-group>, simply repeating such elements with different language attributes are sufficient. As <article title> and <subtitle> have to be unique to an article, <trans-title> and <trans-subtitle> are used to express alternate language data.

Table 1

Tags for multi-language expression in JATS

Fig. 5

Tagged author names of the article in Figure 1 and 2

Workflow of creating Japanese XML articles in JATS

It is a challenge to create XML data from author manuscripts, typically written in Microsoft Word. For English-language articles, eXtyles provided by Inera Inc. is a standard tool to convert a Word file into a JATS XML file for many publishers. Others use offshore vendors to convert word/pdf files to XML.

Unfortunately, eXtyles is not convenient enough for Japanese-languge artilces, nevertheless there is no other readily available system for Japanese texts. Thus publishers and type-setters have been coping with this challenge.

Several approaches were implemented in Japan.

Output MS Word XML and convert it to JATS XML
Use eXtyles and then manually edit the result XML
Paste text to FrameMaker, export XML, and covert it to JATS XML
Ask offshore venders to create XML

In the case of JJGS, the typesetter, Nakanishi Printing Company, has developed its own workflow to create XML as follwos.

Converting Microsoft Word to Microsoft Office Open XML
Converting Microsoft Office Open XML to JATS XML
Validating XML

Converting Microsoft Word to Microsoft Office Open XML

Microsoft Office Open XML is a XML-based file format developed by Microsoft to represent, and its converter can translate into an XML file from MS word⁴⁾. A Word file is styled in advance to enhance the correct XML tagging. As the tag set of Office Open XML is very generic, it can export charts and tables (spreadsheets) as containers into XML.

Fig. 6

An example of Microsoft Office Open XML tags

Converting Microsoft Office Open XML to JATS XML

The output XML file then goes through XSLT to remove unnecessary tags introduced by the Open XML converter. The resulted file is further processed by a Perl program to insert tags as defined by JATS. For English-language articles, it is possible to identify objects such as author names or journal titles fairly obviously, by looking at typeface such as bold faces or italics, or punctuation such as colons or periods. We have to insert word separators manually, especially for author names.

Agglutinative languages, such as Japanese or Korean, are characterized by the attaching of stems and affixes to form longer words to express term conjugation. In Japanese and Korean, this results in completely “agglutinated” sentences with no word separators such as spaces. In Japanese, word separation shall be achieved by identified nouns, e.g., which are in Chinese characters (Kanji) most of the time, and/or using dictionaries, or just manually.

To identify elements for article metadata, we insert separators manually. This is especially the case for author names and affiliations. Japanese author names are often expressed as a combined string, where a surname, e.g., “中西”, and a given name, e.g., “秀彦”, are attached as “中西秀彦”. To tag a such name string, we need to insert a separator manually, e.g., “中西@秀彦”, because, it could be a combination of “中” and “西秀彦”, or “中西秀” and “彦”, and there is no algorithm to determine it correctly. We only know this by experience, or by asking the author himself/herself. Figure 7 shows an example of author names with separators.

Fig. 7

Example of inserted separators

Identifying elements is also have an issue for citations. Family names and given names are almost always not separated, and have to be manually marked for separation. In addition, identifying article titles and journal names have to be done manually.

Validating XML

The result XML is then validated using the Oxygen XML editor, and the final JATS XML is obtained. It will be uploaded onto J-STAGE, and published as full text HTML data. The quality of the article is checked using the preview feature of J-STAGE

Creating PDF

Using AH Formatter

Although JJGS is not published in print, there are strong needs to view articles in PDF. Figures 8 and 9 show a PDF image corresponding to the HTML in Figure 1 and 3 respectively.

Fig. 8

PDF image corresponding to Figure 1

Fig. 9

PDF image corresponding to Figure 3

Such PDFs are created by using AH Formatter⁵⁾ from Antenna House, Inc. We have developed XSLT for this tool. An example is in Figure 10.

Fig. 10

XSLT used for AH Formatter

The XSLT converts a JATS file into XSL Formatting Objects (XSL-FO) which expresses page model format for PDF. The XSL-FO is then converted to PDF using the AH Formatter. The result PDF is used for proofreading by the editorial office and authors. Any proofs will be reflected to the original XML, or modifying the XSLT.

Special Care Needed

PDF files thus created are mostly good as long as STM papers are concerned, as they are basically in the same/similar format as the corresponding western articles. UTF-8, which is the standard character encoding for XML, also enables to express most Japanese characters correctly.

Still we have the following problems.

Avoid punctuations, geminate consonants, and dashes at the top of a line

Although Japanese texts do not use hyphenation of words, we have rules applicable to line breaks.

This type of rules may be handled by the formatter such as AH Formatter.

Fig. 11

Avoiding line-top punctuations (“」”)

Avoiding breaking-up a word, especially person’s names

Fig. 12

Avoiding breakups of certain words (“中西” is a person’s family name)

This can only be achieved inserting “Zero Width Joiner” code (‍) in between such as “中‍西” in advance. This practice causes a drawback where text searching of “中西” fails.

Positioning Figures and Tables

It is also necessary to develop separate XSLTs to process figures and tables in order to create acceptable PDF, which may be the case even for English-language articles published in Japan. An example of such XSLT is shown in Figure 13. This is because Japanese authors/publishers ask the location of figures/tables exactly where they wanted they should be, rather than where the Formatter automatically located.

Fig. 13

Sample XSLT for figures

This requires a lot of manual processing, which certainly raises cost. We, typesetters, have been trying to persuade authors and publishers⁶⁾, but not very successful so far.

What are to be done next

So far, what we need to process are STM articles which are written in standard, western way, and the difficulties we face are limited. In the future, we need to deal with social science/humanities literature, which are more traditional and contain the following characteristics.

Vertical writing

Fig. 14

Horizontal vs. vertical writing

Although this itself does not require any special treatment in JATS tagging, automatic typesetting is not easy.

Vertical writing does not simply mean aligning characters vertically. For example, in writing Arabic numerals or Latin alphabets vertically, there are orientation options such as, 1) to rotate them (left), 2) not to rotate (center), and 3) to use Chinese numerals (right) as in Figure 15.

Fig. 15

Various patters for vertical writing

This means we need to declare writing direction when we create an XML file, such as <writing-direction type-of-direction="vertical">. We do not have such a tag in JATS yet.

Emphasis or Kenten

Fig. 16

Examples of emphases

Emphasis is an extension of boldface or italics, which is often seen in Japanese articles. It is not yet supported by JATS.

Warichu

Fig. 17

Examples of warichus

Warichu is a short note inserted within a sentence in two lines, typically with parentheses. This is often used in humanity scholarly publications, and supported by MS Word.

Conclusion

Writing is a culture. Historically, Japanese writing and typesetting, as well as those of China and Korea, were extremely conscious of visual effect. This is probably because we use pictograph/ideograph writing system. This explains the fact that calligraphy has been so popular and advanced in those far eastern Asian countries.

Thus authors and publishers care about a page layout heavily, even if the page consists of texts only. In describing texts in XML, sometimes it is necessary to code such layout information as Warichu. Maybe we should focus on semantics of Warichu, that is an inserted note, rather than its style, but we have to think. As we go further into traditional Japanese-language papers, we will discover more issues, which may or may not be solved by extending JATS.

References

1.: Terutaka Kuwabara, Combined data analysis using the KAKEN database and Web of Science. 2013/3/6. http://www.mext.go.jp/b_menu/shingi/gijyutu/gijyutu4/030/shiryo/__icsFiles/afieldfile/2013/03/19/1331868_03.pdf (accessed on February, 1, 2015).
2.: Deborah Aleyne Lapeyre and B. Tommie Usdin. Introduction to Multi-language Documents in NISO JATS. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011. http://www.ncbi.nlm.nih.gov/books/NBK62175/ (accessed on February, 1, 2015).
3.: Soichi Tokizane. “Implementing XML for Japanese-language scholarly articles.” Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012. http://www.ncbi.nlm.nih.gov/books/NBK100380/ (accessed on February, 1, 2015).
4.: Introducing the Office (2007) Open XML File Formats. https://msdn.microsoft.com/en-us/library/aa338205%28v=office.12%29.aspx (accessed on February, 1, 2015).
5.: Antenna Hours AH Formatter V6. http://www.antennahouse.com/product/ahf60/ahf6top.htm (accessed on February, 1, 2015).
6.: Hidehiko Nakanishi, “From human readability to machine readability: A proposal from a creator and publisher of an XML journal” (in Japanese) Journal of Information Processing and Management Vol. 57 (2014) No. 3 P 149–156 https://www.jstage.jst.go.jp/article/johokanri/57/3/57_149/_article/references/-char/ja/ (accessed on Febrary, 13, 2015)

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License

Bookshelf ID: NBK279832

Contents

PubReader
Print View
Cite this Page
Nakanishi H, Naganawa T, Tokizane S, et al. Creating JATS XML from Japanese language articles and automatic typesetting using XSLT. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.

In this Page

Introduction
Multi-language articles on J-STAGE
Workflow of creating Japanese XML articles in JATS
Converting Microsoft Word to Microsoft Office Open XML
Creating PDF
What are to be done next
Conclusion
References

Other titles in this collection

Journal Article Tag Suite Conference (JATS-Con) Proceedings

Conference Links

Recent Activity

Clear Turn Off Turn On

Creating JATS XML from Japanese language articles and automatic typesetting usin...
Creating JATS XML from Japanese language articles and automatic typesetting using XSLT - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet].

Creating JATS XML from Japanese language articles and automatic typesetting using XSLT

Authors

Affiliations

Introduction

Multi-language articles on J-STAGE

Workflow of creating Japanese XML articles in JATS

Several approaches were implemented in Japan.

Converting Microsoft Word to Microsoft Office Open XML

Converting Microsoft Office Open XML to JATS XML

Validating XML

Creating PDF

Using AH Formatter

Special Care Needed

Avoid punctuations, geminate consonants, and dashes at the top of a line

Avoiding breaking-up a word, especially person’s names

Positioning Figures and Tables

What are to be done next

Vertical writing

Emphasis or Kenten

Warichu

Conclusion

References

Views

In this Page

Other titles in this collection

Conference Links

Recent Activity