Consider a simple finite state automaton to parse property files.
They look like this:
# a comment
keyword=value
I want to categorise each fragment of text as either comment, keyword
or value. Now throw in a complication. Inside any of those three
things might be literals of the form \uffff
I find myself creating all kinds of rinky dink mechanisms to handle
the literals. I wondered if there is a clean way to do it.
There are two problems.
1) It is clumsy to invent three literal states one for in comment, one
inkeyword and one invalue just so it can remember what it was doing.
Yet whole idea of a finite state automaton in that the memory of the
system is supposed to be encapsulated in the state.
2) you leave the literal state based on a count, not the presence of
some delimiter. I could create 5 states to mark progress down the
literal, but this seems a bit nuts.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
> Consider a simple finite state automaton to parse property files.
>
[quoted text clipped - 19 lines]
> some delimiter. I could create 5 states to mark progress down the
> literal, but this seems a bit nuts.
Roedy,
Why not run the property file through a pre-processor to handle escape
sequences, similar to what javac does? After all, the standard property
file format supports \\ and \ followed by a line break for line
continuation and who knows what else....
HTH,
Ray

Signature
XML is the programmer's duct tape.
Stefan Ram - 31 Dec 2005 21:11 GMT
>Why not run the property file through a pre-processor to handle
>escape sequences, similar to what javac does?
You mean a preprocessor like
native2ascii -reverse
See
http://download.java.net/jdk6/docs/tooldocs/windows/native2ascii.html
Stefan Ram - 31 Dec 2005 21:22 GMT
Raymond DeCampo <nospam@twcny.rr.com> was quoting:
>>I want to categorise each fragment of text as either comment, keyword
>>or value. Now throw in a complication. Inside any of those three
>>things might be literals of the form \uffff
>>I find myself creating all kinds of rinky dink mechanisms to handle
>>the literals. I wondered if there is a clean way to do it.
The clean way is a scanner with two layers:
The first layer converts each \u-Sequence to a code point.
The second layer then reads code points supplied by the first
layer and does not have to care about the \u-sequences
anymore.
Raymond DeCampo - 01 Jan 2006 01:21 GMT
> Raymond DeCampo <nospam@twcny.rr.com> was quoting:
>
[quoted text clipped - 11 lines]
> layer and does not have to care about the \u-sequences
> anymore.
Gee, thanks for replying to my post, removing my contribution, removing
the OP's name making it seem as if I wrote what the OP did to the casual
observer, and then re-stating my idea. That was really helpful.
Ray

Signature
XML is the programmer's duct tape.
Roedy Green - 02 Jan 2006 16:26 GMT
On Sat, 31 Dec 2005 20:46:42 GMT, Raymond DeCampo
<nospam@twcny.rr.com> wrote, quoted or indirectly quoted someone who
said :
>Why not run the property file through a pre-processor to handle escape
>sequences, similar to what javac does? After all, the standard property
>file format supports \\ and \ followed by a line break for line
>continuation and who knows what else....
I considered that, but I wanted to display the file literally. If the
file contained embedded \uxxx characters in binary, I wanted to
display them differently from ones properly encoded with \uxxxx
I have since solved the problem with kludge, a lookahead that handles
the entire sequence as if it were a single char from the overall state
machine point of view.
You can see it working at http://mindprod.com/jgloss/properties.html

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green <my_email_is_posted_on_my_website@munged.invalid>
might have written, quoted or indirectly quoted something like:
>I want to categorise each fragment of text as either comment, keyword
>or value. Now throw in a complication. Inside any of those three
>things might be literals of the form \uffff
>I find myself creating all kinds of rinky dink mechanisms to handle
>the literals. I wondered if there is a clean way to do it.
The clean way is a scanner with two layers:
The first layer converts each \u-Sequence to a code point.
The second layer then reads code points supplied by the first
layer and does not have to care about the \u-sequences
anymore.