From Microsoft Word to NISO STS with Inera eXtyles | Standards Symposium 2021

With Inera eXtyles, publishers of complex and structured content (journals, books, reports, and standards) can automate time-consuming editorial tasks and easily convert Word to XML.

In this session from the 2021 Typefi Standards Symposium, Robin Dunford, Senior Solution Architect at Inera, demonstrates how you can use eXtyles STS to create validated NISO STS XML files directly from Microsoft Word.

“ISO was our first standards customer—we configured eXtyles for them over 10 years ago to allow them to convert their Microsoft Word documents, first to ISO STS XML and then more recently to NISO STS.”

Transcript | Presenter

Transcript

00:00	Introduction: Word to NISO XML with Inera eXtyles
02:25	eXtyles STS
02:59	Demo: Preparing a Word document for export to NISO STS XML Auto-Redact Standards Reference Processing Inline Standard Citation Processing URL Checking Citation Matching
11:24	Demo: Exporting a Word document to NISO STS XML

Introduction: Word to NISO XML with Inera eXtyles (00:00)

Hi, my name is Robin Dunford. I’m a Senior Solution Architect with Inera, now part of Atypon, and I’m going to talk to you today about our solution for going from unstructured Microsoft Word content through to NISO STS XML.

The key to this is our tool, eXtyles, which some of you may have heard of before, which is a plug-in to Microsoft Word that we developed over the last 20 years, to help content providers structure author-generated content and get information into it that allows them to then generate structured outputs, whether that be a Microsoft Word file again, or SGML or XML.

And for more than 10 years, we’ve been working with standards publishers with this tool. ISO was our first standards customer, and we configured eXtyles for them over 10 years ago to allow them to convert their Microsoft Word documents, first to ISO STS XML and then more recently to NISO STS.

Since ISO came on board, we’ve worked with a number of other standard publishers, both inside and outside the ISO network, who have now adopted eXtyles.

So, as I said, eXtyles is a way of getting structure into Word documents so you can easily apply semantically meaningful paragraph styles that identify parts of the document.

You can apply business rules to normalise spelling, punctuation, wording, citation styles, reference styles, and so on. You can identify and parse out elements of cited standards and also validate URLs that are referenced in the document.

And eXtyles will allow you to generate links between cited objects, such as clauses, tables, figures, equations, and so on, and their citations are in the document. eXtyles will automatically identify missing or uncited items that you can then resolve.

And finally, eXtyles allows you to make a one-button conversion of the Microsoft Word document to NISO STS XML output with full granularity linkages between citations and so on.

eXtyles STS (02:25)

So, this experience of working with ISO and a number of other standards publishers has allowed us to develop a tool called eXtyles STS.

What eXtyles STS is, is an out of the box solution to allow standards publishers to convert unstructured Word to NISO STS XML without the need for customisation—making eXtyles STS a highly affordable solution for getting structured NISO STS XML from Microsoft Word.

Preparing a Word document for export to NISO STS XML (02:59)

This is a typical document as you might find it prepared by a committee. You can see that the paragraph styles that have been used in this case, we’ve got a mixture of things like Normal, which are just the built-in Word styles, and then we’ve also got some styles from a template that the committee had been supplied, but there’s no real overall patterns of how well the template has been applied.

In some cases it’s been applied properly, and in other cases, the authoring group have just used built-in styles.

So I’m going to skip ahead a little bit, some of the preparation work that you would do in eXtyles. This document has been activated, which essentially means prepared for use with eXtyles. And then it’s also been styled using the eXtyles style palette, and the style palette is a tool that allows you to apply a constrained list of paragraph styles to the document quickly and easily, and make sure that the styles that are in use are the right ones for your content.

Auto-Redact

The next step is Auto-Redact, which is a set of sophisticated find or replace dictionaries which allow you to apply business rules around spelling, around punctuation and spacing, and so on. And these can be highly tailored to your content if you need to.

And they can also be applied just to specific paragraph styles, so if you have a style that you’d changed, that you only want to make to normative references for example, then you can do that and it won’t change the same text if it’s found elsewhere.

So that’s running through a set of these rules at the moment, making the changes and making sure that the wording and the spelling applies and it obeys your business rules. eXtyles STS comes with a set of predefined rules, but these can be expanded out if you need rules to cover other situations or other constructs.

Standards Reference Processing

I’m going to move on to the next step on the menu, which is Standards Reference Processing. And what this is doing is locating normative references and also the bibliography, and parsing out detail from those—the publishing organisation, the title of the standard, and so on—and applying granular structure to that content.

So, we get an alert telling us there are two unknown references. There’s the normative reference here, which has been styled with different styles for the publisher, for the number of the standard and for the title.

And at the end here we’ve got, again, some standards that have been parsed out into the different entities within the standard, but then we’ve got a couple of pieces of what you might call grey literature that don’t follow that same pattern. We haven’t really applied much structure here apart from identifying the number of the reference and also tagging the year of publication.

Inline Standard Citation Processing

The next step, Inline Standard Citation Processing, this is going to look for citations in the text of standards and again, tag those different character styles for the different elements—the publisher, the standard number and so on.

That process reports no problems. And we can see here in the text that citations have been picked up, and the publisher, the document number, year, have been applied. Even sections of the citation have also been picked up.

URL Checking

What URL Checking does is it looks for anything in the document that looks like a URL, and actually essentially tries to ping that URL across the internet and then sees what happens. So you might get URLs that have been redirected or simply don’t work if they put a mistake, a typo in the text or something like that.

And you might get a URL where a username and password is required to access the content. So we would warn you about that because that might be important to know that the content is not freely available to all readers necessarily.

We’ve got a message here telling us that there’s one issue with URLs, and we can see here that the URL is actually in a footnote. The message we’re getting is “the internet name is not resolved”, and that’s because I deliberately introduced a misspelling into the URL. It’s actually the UK spelling of colour with a ‘u’.

So you can tell them that you need to go and look at that URL and figure out what the problem is, resolve it, and that can be useful. Obviously you don’t want to be publishing URLs that don’t work.

Citation Matching

And finally, Citation Matching. So this is going to look for citations of both the literature, so the bibliography in this case, but also things like tables, figures, and sections in the text.

It’s asking us what citation style is in use, I can see here it’s numbers in brackets on the baseline. I select that option, hit “OK” and now Citation Matching is going to be going and looking through the document for citable items, and then also looking for citations of those items and warning if things are not cited, as we’ll see, and also apply a character style for cited items.

Ultimately, these links between cited items and their citations will be reflected in the XML. So eXtyles will be using these character styles that have been applied to construct linkages in the XML between the cited item and its citations.

Seven problems from citation checking—we can just quickly look at those. You can see things like this citation of annex D here is cited, citations of a section there and of a bibliographic reference there. So we can see Citation Matching is telling us that these figures are not cited in the annex, and that may be fine according to your business rules or they don’t need to be, but you’re getting the alert annex B isn’t cited and so on.

Exporting a Word document to NISO STS XML (11:24)

And finally we now have this one button, Export to XML, and obviously you might have an editing step that goes ahead at some point here as well, once that structure is into the Word document, before exporting to XML.

This is using a combination of the paragraph styles that were applied and also these character styles, and then also internal intelligence about structures within the document, for example, list items and so on, as we’ll see from the XML shortly.

We’ve got a message telling us that the export has been successful, and here we have the XML and you can see that, for example, in the Word file, this is just a generic front matter heading, the intelligence of the export has applied a sec-type attribute, similarly for scope and so on.

So there’s a lot of added intelligence on top of the paragraph and character styles that eXtyles has applied that help you to generate that highly granular XML file.

Thanks very much for your attention, and I hope you’ve enjoyed the presentation.

Robin Dunford

Senior Solution Architect | Inera, an Atypon company

All over the world, Inera customers use eXtyles and Edifix to standardise content in Word and create high-quality XML. Inera’s editorial and XML solutions drive multiformat publishing for scholarly journals and books, standards, government documents, and more.

Robin Dunford has worked with Inera since 2012. He is a member of Inera’s customer support team and is also involved in setting up new eXtyles configurations; eXtyles user, administrator, and developer training; eXtyles software development; and international publishing events and conferences. Robin has worked closely with standards organisations, including IEEE, ISO, CEN/CENELEC, and several national standards bodies, on eXtyles configuration, eXtyles training, and workflow consulting.

Robin holds a PhD in Plant Biochemistry and spent six years in academic research before moving into publishing.