On Mar 26, 5:41 pm, "Aryeh M. Friedman" <Aryeh.Fried...@gmail.com>
wrote:
> > > now the question is how can I make it so sym[...] is T or State<T> the
> > > only thing I can think of is something like the following syntex
[quoted text clipped - 90 lines]
>
> --Aryeh
I just looked at my copy of GoG and I do not see how a builder solves
the problem since I would still need to do the instanceof in the inner
State.consume(T sym,Queue<T> q). Besides if I am right that all the
states I listed+sink+mem via the queue is sufficient to make a UTM I
do not see any need for sub classes. Namely any subclasses would
actually be tokenizers (i.e. define some composite state machine
[normally a SequenceState where the members of the sequence are
various states]) and then pass the raw output to it (allow the output
sink to collect the token sequence). For example for the sake of
arguement I can define Java (at least at the top level)
class:package,(import),classProper;
package: "package", dotIdentifier
dorIdentifier: identifier,('.',idenetifier)*;
etc. (am I sorry if I am using adhoc EBNF but I don't know EBNF well
enough to do it from memory)
which means I do something like this in the tokenizer for class:
pubklic ClassMachine extends StateMachine
{
public classMachine()
{
machine=new SequenceMachine<Character>(); // T =
Character since that denotes the input and output sink it will create
machine.addSymbol(new PackageMachine(),true); // 2nd arg
states is if is manidtory
machine.addSymbol(new ImportMachine(),false);
machine.addSymbol(new ClassProperMachine(),true);
}
}
and to parse any Java source code I do:
public class StateMachine
{
...
public void consume(List<Character> text)
throws BadStateException
{
machine.consume(text); // will through an
BadStateException on syntax error (i.e. we arrive at a null state
before the end of input)
}
}
and I just continue recursivelly to define the state machine until I
get to a machine that can understand indivual characters (such as
DigitMachine or LetterMachine for example)
I know if I am only dealing with one formal grammer it is easier to
hard code for it and in general ANTLR or something like it is used to
build parsers.
The "big picture" motivation of wanting to parse this way:
Both approachs have one problem they completely loose the power of OO
in parsing... for example 90% of C/C++/Java/C# are identical so
instead of reinventing the wheel why not have them
inherit some C family parser and then just customize a few state
machines for the particular lang (expression (ptr vs. ptr], method
def, keyword list, etc.). If you take this to the logical extreme the
C family parser inherits the Algol family parser which inherits the
formal grammer parser which inherits the ascii parser which inherits
the unicode parser (which for completeness inherits the null parser).
Thus you can build a single parser frame work for all parsable langs
(all non-natural ones for the most part). One intresting side effect
of this "super" parser is I can intermix langs in the same logical
unit (file) and if the concrete parser knows how handle a subset of
the langs it can ignore input it doesn't understand. Now if it
understands 100% of the langs used in a logical unit and has a shared
symbol table between them we get a compilor/interpreter front end that
allows you to use multiple langs in the same class and/or compilation
unit depending on "master" lang orientation. The other thing it
let's us do is if we have a system wide rule that says all data has to
be in some data rep lang like XML then the partial understanding
hinted above can be extented all the way to the end-user OS level
(i.e. redirection, network streams and pipes where not all the input
data need be parsable by the recieving process). This combined with
some other ideas (persistent objects instead of files and using what
Mozart [a strange little lang] calls delayed binding values to control
process scheduling) that are not related to this directly is the
organizing prinicable behind a research OS I am working on.
--Aryeh
Aryeh M. Friedman - 27 Mar 2007 05:08 GMT
Oops forgot to mention if anyone wants to see the general idea (but
hard coded for a given lang) look at the template processing in
parser.tokenizer.Statement.doTemplate in RITE (a toy lang I wrote to
teach my self how to write languages and teach programming to brand
new CS majors at the Jr. College level). http://www.flosoft-systems.com/rite
(sorry for the plainness of the site) also since all materials
(including the attached text book) are still vary much in draft (RITE
it self is in beta) form and under a Sustainable License (see
http://www.miai-siw.org) any contributions will be compensated in some
way or an other.
--Aryeh