Hi,
I have two text files, text1.xml and text2.xml. text1.xml is the bench
mark file and text2.xml is generated by my Java program. In my junit
test, I want to compare the generated text2.xml against the bench mark
file text1.xml. The bench mark file text1.xml is much bigger than
text2.xml, because it has other lines in it. All the lines in text2.xml,
except a few lines containing timing information, should find a match in
text1.xml
In another words, lines 2 to 1101 in text2.xml should be identical with
lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
text2.xml should be identical with lines from 2799 to the end in text1.xml.
One way for comparison is using for loop and with these hard-coded line
number, compare line by line. But I hate the hard-coded line number.
I tried the way below, but is too slow:
1)append all the lines in text1.xml into a huge single String line, e.g.
benchLine;
2)append lines 2 to 1101 in text2.xml into another huge String line,
e.g. patternLine;
3)Pattern pattern = Pattern.compile(".*" + patternLine + ".*"); //use
the huge line to generage a regular expression pattern
pattern.macher(benchLine).find(); //hope it is true
4)repeat step 2) but with lines 1103 to the end in text2.xml and step 3
But this is tooooo slow.
Thank you for your help.
Oliver Wong - 20 Jun 2007 14:38 GMT
[...]
> In another words, lines 2 to 1101 in text2.xml should be identical with
> lines 850 to 1949 in text1.xml; lines 1103 all the way to the end in
[quoted text clipped - 16 lines]
>
> But this is tooooo slow.
If the requirements are "lines 2 to 1101 in text2 should be identical
with lines 850 to 1949 in text 1; etc.", then I recommend you hardcode the
line numbers.
If the requirements are "lines 2 to 1101 should be identical with some
portion of text1, but I don't want to specify where exactly", then you
STILL have to hardcode the line numbers "2" and "1101", if not the line
numbers of text1.
So I recommend you hardcode the line numbers.
- Oliver
Christopher Benson-Manica - 20 Jun 2007 15:39 GMT
> One way for comparison is using for loop and with these hard-coded line
> number, compare line by line. But I hate the hard-coded line number.
Why not specify them in a properties file, or possibly allow them to be
specified by command-line arguments? That should be fairly easy.

Signature
C. Benson Manica | I *should* know what I'm talking about - if I
cbmanica(at)gmail.com | don't, I need to know. Flames welcome.
Roedy Green - 29 Jun 2007 15:03 GMT
>One way for comparison is using for loop and with these hard-coded line
>number, compare line by line. But I hate the hard-coded line number.
If you don't burn in a line number, you need to search for some
pattern.
Here are some ways you can proceed. This is a Chinese menu:
1. Use the SPLIT utility. Embed commands in your files for SPLIT to
work on to split out the useful sections. then have it extract the two
pieces that should be identical. Split is very fast. See
http://mindprod.com/products.html#SPLIT
2. count lines and squirt out the extracted juice to two files. Make
the line counts named constants so you won't feel so guilty, or
perhaps calculated named constants to assuage your guilt even further.
3. scan for patterns using indexOf or regex and split the file.
see http://mindprod.com/jgloss/string.html
http://mindprod.com/jgloss/regex.html
4. calculate a checksum of just the interesting parts and compare the
checksums. See http://mindprod.com/products1.html#UNTOUCH for code to
calculate a fast Adlerian checksum. See
http://mindprod.com/jgloss/adler.html
http://mindprod.com/jgloss/md5.html
5. compare the two extracts with MS FC file Compare.
6. compare the two parts by first comparing length. Then read and
compare chunk by chunk, reading raw bytes. You just want a boolean,
not the offset or the text differences. see
http://mindprod.com/products1.html#HUNKIO
7. feed the two extracts into a DIFF program.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com