On Wed, 28 Dec 2005 04:37:33 -0800, asaf.lahav wrote:
> Hi all,
>
[quoted text clipped - 5 lines]
> I would appreciate any pointers...
> Thanks in advance,
This problem is decidedly non-trivial. What do you want to do? Do your
documents necessarily have something in common? Do they allow only some
modifications to be applied (for example, only add / delete subtrees)?
I am not currently aware of a "out of the box" solution for the problem,
but the more specific you are, the more likely someone will be able to
point you in the right direction.
Consider:
<A>
<B foo="bar"/>
<C/>
</A>
------------------------------
<N>
<B foo="bar"/>
<D number="0"/>
</N>
What would the differences be?
* Delete root node, all children, and create new tree.
* Change root node from A to N, delete child C, create child D
* Change root node from A to N, change child C to D, add attribute "number"
Something else still?

Signature
You can't run away forever,
But there's nothing wrong with getting a good head start.
--- Jim Steinman, "Rock and Roll Dreams Come Through"
EricF - 31 Dec 2005 06:48 GMT
>On Wed, 28 Dec 2005 04:37:33 -0800, asaf.lahav wrote:
>
[quoted text clipped - 34 lines]
>
>Something else still?
Google for XML Diff
I did some research on this while back for a project I was working on
and came across a graduate student who was working on this as his
thesis. The more general of a solution that you need, the harder the
problem becomes, (the grad student had a proof that the problem
allowing a subtree from one document to appear in the second document,
possibly altered, possibly multiple times, is NP Complete). If you
have a particular schema that both docs are known to be valid against I
think you're most likely to get the best solution by doing this one
yourself where you can optimize when appropriate. The xml I was
parsing was actually custom built to serialize a java object in a
database, so I ended up finding it easier to simply build to objects
and write code that would find the differences in the objects since
that is a more concrete problem space. But that also depends on what
you're using it for. If you're diff is going to be performed a lot on
a performance intensive environment, parsing and constructing the
objects might end up being slower. Everything's a trade off.