Partial Galaxy ToolConfig to DocBook CmdSynopsis conversion with XSLT RegEx

e · l · n
Jul 21, 2011

As blogged about before, I was interested in knowing the difference between the Galaxy Toolconfig, and the DocBook cmdsynopsis format, for the purpose of automatically generating wizards (see an example that I screencasted here) to fill in the required parameters to command line tools. To quickly get some hands-on experience with the formats, I started creating an XSLT transformation from galaxy toolconfig format to the docbook cmdsynopsis format.

I quite quickly realized some important differences, such as that cmdsynopsis lacks the ability to specify a list of possible/valid options for a parameter, which could be used for creating drop-downs in the wizards. But apart from that, the little work on the transformation I had already done when realizing this, actually was a nice little exercise in using regex with xslt. Look at the command tag content in this excerpt of a Galaxy ToolConfig XML file:

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1">
  <description>converts SAM format to BAM format</description>
  <requirements>
    <requirement type="package">samtools</requirement>
  </requirements>
  <command interpreter="python">
    sam_to_bam.py
      --input1=$source.input1
      --dbkey=${input1.metadata.dbkey} 
      #if $source.index_source == "history":
        --ref_file=$source.ref_file
      #else
        --ref_file="None"
      #end if
      --output1=$output1
      --index_dir=${GALAXY_DATA_INDEX_DIR}
  </command>
  <inputs>
    <conditional name="source">
      <param name="index_source" type="select" label="Choose the source for the reference list">
        <option value="cached">Locally cached</option>
        <option value="history">History</option>
      </param>
      <when value="cached">
        <param name="input1" type="data" format="sam" label="SAM File to Convert">
           <validator type="unspecified_build" />
           <validator type="dataset_metadata_in_file" filename="sam_fa_indices.loc" metadata_name="dbkey" metadata_column="1" message="Sequences are not currently available for the specified build." line_startswith="index" />
        </param>
      </when>
      <when value="history">
        <param name="input1" type="data" format="sam" label="Convert SAM file" />
        <param name="ref_file" type="data" format="fasta" label="Using reference file" />
      </when>
    </conditional>
  </inputs>
  <outputs>
    <data format="bam" name="output1" label="${tool.name} on ${on_string}: converted BAM" />
  </outputs>
</xml>

... you see that in the command tag, the actual syntax of the command is specified in a kind of "free text" format ... This might not be exactly what one might think to use XSLT transformations for, but together with the regex functionality in XSLT 2.0 you definitely has this option too. Helped by this article on xml.com, I put together this little XSLT stylesheet for parsing up the free text content of that command tag (haven't got to the more detailed config inside the inputs-tag in the galaxy format, but might not need either, if staying with the galaxy format anyway):

<?xml version="1.0"?>
 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
 
    <xsl:output method="xml" indent="yes" encoding="UTF-8" />
 
    <xsl:template match="/">
        <cmdsynopsis>
            <xsl:apply-templates select="tool/command" />
        </cmdsynopsis>
    </xsl:template>
 
    <xsl:template match="tool/command">
        <command>
            <xsl:value-of select="@interpreter" />
        </command>
        <xsl:for-each select='tokenize(
                                    replace(
                                        replace(
                                            replace(
                                                replace(
                                                    .,
                                                    "[ ]+",
                                                    ""),
                                                "\n#[^\s]+",
                                                ""),
                                            "\n+",
                                            " "),
                                        "(^\s+|\s+$)",
                                        ""),
                                    "\s")'>
        <xsl:if test='matches(.,"\{")!=true()'>
            <arg>
                <xsl:value-of select='replace(.,"=.*","")'></xsl:value-of>
                <xsl:if test='matches(.,".*=.*")'>
                    <xsl:text> </xsl:text>
                    <replaceable>
                        <xsl:value-of select='replace(.,".*=\s*\$?","")'></xsl:value-of>
                    </replaceable>
                </xsl:if>
            </arg>
        </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

... a bit crazy with all these nested regex replace function calls, no? :) ... but, I can tell you, it actually works very good! Found it easier to work with than many other regex implementations (i.e. matching newlines could be done with "\n", which I think you can't do by default in some other ones).

I can also mention that the tokenize function splits a string into an "array" of the parts between the parts that is matched by the expression given to tokenize (similar to "split" in some other languages, like python).

The result of the transoformation? Here it goes:

<?xml version="1.0" encoding="UTF-8"?>
<cmdsynopsis>
   <command>python</command>
   <arg>sam_to_bam.py</arg>
   <arg>--input1 <replaceable>source.input1</replaceable>
   </arg>
   <arg>--ref_file <replaceable>source.ref_file</replaceable>
   </arg>
   <arg>--ref_file <replaceable>"None"</replaceable>
   </arg>
   <arg>--output1 <replaceable>output1</replaceable>
   </arg>
</cmdsynopsis>

Not perfect (there are double "--ref_file" arguments still), but at least it has parsed up the different arguments, removed some galaxy specific stuff (the parts enclosed by "{}") and the conditional statements. At least I think it shows that xslt + regex is actually an option, don't you think? :)

A caveat here though: I found out that most of the XSLT processor tools for Ubuntu (xsltproc, xalan, the one built into php5) don't accept XSLT 2.0 features such as regex, so I ended up using the java based saxon processor.

To call it for doing a transformation, you simply go (when using the open source "home edition"):

java -jar saxon9he.jar [xml-file] [xslt-file] > [output-file]

Works good! (does a good job of formatting the XML too).