This page is for the discussion of XML libraries, modules, facilities, and utilities in Ruby, with an eye toward what might eventually be included in the standard distribution (as well as useful things which might not).
It sounds like we certainly should add one, since there are so many available parsers. I vote for REXML, it's mature and all ruby.
PragDave: I'm thinking we can use reflection and method_missing to do a far better job of a DOM-like interface. No need for explicit methods to access the structure, just
document = XML.parse($stdin) puts document.name.first_name
RossShaw: I have seen and used this type of interface in a couple of non Ruby XML systems (IBM MQSI and CommerceQuest? XMObject). I came across a couple of issues. First is that an XML tag name can include both periods and colons amongst other tokens. Therefore the first problem is whether a statement will parse OK (in the case of colons) and then the next is whether a period is part of a tag or part of the tree hierarchy. MQSI fixes this by optionally quoting part of the tag name like
SET document."single.tag".foo = "bar";
<document> <single.tag> <foo>bar</foo> </single.tag> </document>
Another problem if you use method_missing is what if you have a root tag that happens to be one of the methods in Object such as Object#class. Do you then have to go and disable all the standard Object methods.
Although this interface would appear to be a nice Ruby alternative I don't think it would hold up when hit with some of the real world XML standards out there.
I vote for something which uses XPath.
AviBryant: I played with an interface like that at one point - it allowed access as above, and also had a lot of smart/convoluted ways of building up trees. For example:
Node.foo #produces <foo/> Node.foo("size"=>1) #produces <foo size="1"/> Node.bar("hello world") #produces <bar>hello world</bar> Node.baz("goodnight moon", "size"=>2) #produces <baz size="2">goodnight moon</baz>
As well as, for example:
Node.silly.project("name"=>"xml").goal = "A clean api"
#produces: # <silly> # <project name="xml"> # <goal>A clean api</goal> # </project> # </silly>
I'll put it up at http://beta4.com/node.rb if anyone wants to take a look.
TobiasReif produced a summary of requirements from mailing list discussions, which I copied here and reformatted.
End of summary of Tobias's page
(version @ my page: http://www.pinkjuice.com/ruby/xml_stdlib_proposal.txt )
Most important of all, why aren't more people expressing their thoughts and preferences here? If you do not make an explicit statement, you will be lumped into the "most people ..." group whether you feel that way or not. ;) The mailing list seems to be dominated by a handful of people simply rephrasing their ideas in each new message. We need more voices.
James Britt (firstname.lastname@example.org)
DOM is not the only standard way of manipulating XML defined by the W3C. XSLT is a much simpler way of transforming XML and is not defined in terms of the DOM. It is becoming increasingly popular and, IMHO, will eventually supplant most uses of the DOM apart from scripting of client-side interactivity.
So, supporting the DOM to make existing XML users feel at home will (a) not make all XML users feel at home, and (b) be increasingly less useful in that respect over time.
I take this back. The DOM, with all its faults, is still easier to use than XSLT. XSLT is one of the worst languages I've ever had the misfortune to use (and I've used COBOL!). On the other hand, XPath is a good way of navigating XML documents. The XPath implementation in REXML is very convenient.
DOM and XSLT aren't in competition. DOM is about getting XML into a resident object model for whatever use and XSLT is about transformation, usually from one DOM to another based on the rules in an XSL 'stylesheet' --RustyF
XSLT's declarative syntax and functional programming style is an alien paradigm for many programmers in OOP-land (I find it quite intuitive, but that's just the way my mind works). It's also verbose and oddly restricted (although EXSLT, language-specific extensions, and XSLT2.0 address some of these restrictions). Compilation of XSLT stylesheets to native "transformer" objects, and the ability to construct, configure and modify these objects in a more Ruby-friendly way (i.e. bypassing the need to write raw XSLT), would be powerful tools to add to an XML toolkit.
JasonArhart: IMHO DOM is a really good solution only for generic manipulation of XML documents (e.g. XML editors). For modelling specific types of XML documents in-memory I think XML binding is a better way to go. The object model reflects the intent much more accurately and validation can be performed on the tree in memory.
I am working on an XML binding system in along the lines of JAXB (i.e. translate an XML Schema document into Ruby classes) for my own purposes, which I plan to make open source if nobody else writes one before I get mine working.
As far as stream parsing goes, I would rather see Ruby standardize on a pull parser than a SAX parser. I find them easier to use, and providing the best of both worlds by wrapping a SAX parser around a pull parser would be trivial.
For the record, if someone proposes and describes a good Ruby-way SAX API, I'll replace REXML's streaming parser API with it in short order. I'll add any reasonable API (pull parsing, event parsing, etc) that can be effectively described and argued.
Any chance, that XML in Ruby is flexible enough to parse even HTML (not XHTML, which should work flawless anyway)? Making the largest database in the world - the www - seamlessly available in ruby would be just great.
Re: XML parsers and HTML. The W3C XML rec. is explicit about parsers failing on not-XML input. The reasoning was to avoid the inconsistent behavior among parsers we have among web browsers. Better to see about hooking a tool like 'tidy' into a process pipeline to convert sad, ragged HTML into smiling, happy XML.
We see Tidy for Ruby soon then? That'd made me smiling and happy...
DossyShiobara: Is there a pure-Ruby implementation of an XSLT engine?
Matz told me that support for Unicode in Ruby is planned for Ruby 1.9.
Berndt Jung: Re:XML Parsers and HTML
I'm currently working on a project which requires me to parse html modify it a bit and send it back out as html. I'm building this thing in java, but would much rather be using ruby and rails for the back end stuff. The reason that I'm using java is because of the xerces xml package, and nekoHTML which is an html parser. Having tried a bunch of other HTML parsers (tidy, tagsoup, etc), writing a parser that works on all html is not trivial work. It would be nice though...
Not sure if any of you have used the XML::Twig library for Perl, but it provides the ease of use of a DOM/XPATH model but is stream based so that you can deal with large files easily. Being a perl API, it is not very much the "Ruby-way" as is, but it might provide some good ideas. XML::Twig can be found at [XML::Twig]