Java Forum / General / February 2007
Parens Matching
Ken Kast - 16 Feb 2007 03:34 GMT Does anyone have a regular expression pattern that would include a test for balanced parens of arbitrary nestedness?
IchBin - 16 Feb 2007 04:01 GMT > Does anyone have a regular expression pattern that would include a test for > balanced parens of arbitrary nestedness? Is arbitrary 'nestedness' some new type of epistemological terminology?
Sorry, just could not stop myself.
 Signature Thanks in Advance... http://weconsultants.prophp.org IchBin, Pocono Lake, Pa, USA http://ichbinquotations.awardspace.com ______________________________________________________________________ 'If there is one, Knowledge is the "Fountain of Youth"' -William E. Taylor, Regular Guy (1952-)
Ken Kast - 16 Feb 2007 04:54 GMT Nope. Something I made up in order to make the message not so dry.
>> Does anyone have a regular expression pattern that would include a test >> for balanced parens of arbitrary nestedness? > > Is arbitrary 'nestedness' some new type of epistemological terminology? > > Sorry, just could not stop myself. Michael - 16 Feb 2007 07:12 GMT > Does anyone have a regular expression pattern that would include a test for > balanced parens of arbitrary nestedness? No. No one does.
There's a well known proof in Computer Science theory that it is not possible to create a regular expression for balanced parentheses.
Look up "the pumping lemma" for details.
Michael
Mark Jeffcoat - 16 Feb 2007 12:41 GMT >> Does anyone have a regular expression pattern that would include a test for >> balanced parens of arbitrary nestedness? [quoted text clipped - 3 lines] > There's a well known proof in Computer Science theory that it is not > possible to create a regular expression for balanced parentheses. Michael is quite right. Of course, that proof is silent on the subject of creating a regular expression that accepts balanced strings with at most Long.MAX_VALUE open parens. (And really, most texts will have fewer than half that many.)
While you're waiting for the program you wrote to write that regular expression to finish, you can pass the time with a counter that increments on '(', and decrements on ')'.
 Signature Mark Jeffcoat Austin, TX
Alex Hunsley - 17 Feb 2007 12:21 GMT >>> Does anyone have a regular expression pattern that would include a test for >>> balanced parens of arbitrary nestedness? [quoted text clipped - 6 lines] > on the subject of creating a regular expression that accepts > balanced strings with at most Long.MAX_VALUE open parens. Although in theory you could actually count this many with a single long:
Long.MAX_VALUE - Long.MIN_VALUE + 1
I have a feeling this expression is equal to:
- (2 * Long.MIN_VALUE)
Tim Slattery - 16 Feb 2007 14:15 GMT >> Does anyone have a regular expression pattern that would include a test for >> balanced parens of arbitrary nestedness? [quoted text clipped - 3 lines] >There's a well known proof in Computer Science theory that it is not >possible to create a regular expression for balanced parentheses. OTOH, it would be pretty simple to write a loop that cycles through a string and keeps track of left and right parens. Add 1 to the counter when you find a left paren, subtract one for a right paren. If the counter ever goes negative, throw an error. If it's not zero at the end of the string, throw an error.
-- Tim Slattery Slattery_T@bls.gov http://members.cox.net/slatteryt
Michael - 16 Feb 2007 19:14 GMT > >There's a well known proof in Computer Science theory that it is not > >possible to create a regular expression for balanced parentheses. [quoted text clipped - 4 lines] > counter ever goes negative, throw an error. If it's not zero at the > end of the string, throw an error. I guess I should have spelled it out further - you can't match balanced parentheses *with a regular expression.* Of course, there are other ways to do it. Either write the kind or program above, or write a recursive decent parser, (or use a regular expression that only matches up to N balanced parentheses) etc. Basically, balanced parentheses aren't a regular grammar, they're a context free grammar, so you need a more powerful tool. In particular, they're a grammar somthing like this: Expression -> nothing | ( Expression )
Sorry if I caused more confusion that I removed. It was late.
Michael
Alex Hunsley - 17 Feb 2007 01:27 GMT >>> Does anyone have a regular expression pattern that would include a test for >>> balanced parens of arbitrary nestedness? [quoted text clipped - 8 lines] > counter ever goes negative, throw an error. If it's not zero at the > end of the string, throw an error. Although even this isn't correct, in theory... due to the finite memory of the computer, it is possible to give it a sequence that should pass (the brackets match), but won't pass, because the computer runs out of memory. Ok, a little silly, but...
> -- > Tim Slattery > Slattery_T@bls.gov > http://members.cox.net/slatteryt Lew - 17 Feb 2007 04:36 GMT > Although even this isn't correct, in theory... due to the finite memory > of the computer, it is possible to give it a sequence that should pass > (the brackets match), but won't pass, because the computer runs out of > memory. Ok, a little silly, but... Not so silly, really. It points to a valuable principle of workaday programming, nay, engineering overall. When the theory says there is no perfect algorithm, the practice suggests to use an imperfect one. You code to handle the realistic cases and discontinuously reject inputs that exceed the algorithm's ability, like the memory-hog example.
This way of thinking also leads one to jump out of "What regular expression?" into "What algorithm?"
- Lew
Alex Hunsley - 17 Feb 2007 12:19 GMT >> Although even this isn't correct, in theory... due to the finite memory >> of the computer, it is possible to give it a sequence that should pass [quoted text clipped - 9 lines] > This way of thinking also leads one to jump out of "What regular > expression?" into "What algorithm?" Good points. I have also touched on this subject with a reply I just wrote to a post from Daniel. In Comp Sci. proper, when you're talking about bracket matching, you want a machine that solves it for *any* input. So in that sense, it can't be done by a normal real world machine...
lex
> - Lew Chris Uppal - 17 Feb 2007 17:15 GMT > Although even this isn't correct, in theory... due to the finite memory > of the computer, it is possible to give it a sequence that should pass > (the brackets match), but won't pass, because the computer runs out of > memory. Ok, a little silly, but... You can do it in constant space for arbitrary length inputs /if/ you are allowed to use the input as working space too -- just replace '()' by '' until there's nothing left.
A more elegant method would be to keep the unclosed-bracket count in unbounded integer representation (BigInteger-style), stored in (overwriting) the portion of the input sequence which has already been scanned.
-- chris
Alex Hunsley - 18 Feb 2007 20:08 GMT >> Although even this isn't correct, in theory... due to the finite memory >> of the computer, it is possible to give it a sequence that should pass [quoted text clipped - 4 lines] > allowed to use the input as working space too -- just replace '()' by '' until > there's nothing left. Yay! That's the Turing Machine way. Overwrite the tape... There was something perversely satisfying about 'writing' Turing machines at uni to solve problems like multiplication of two numbers (in binary). The first and original "virtual machine"!
I love the idea that it wouldn't be hard to build an actual (finite!) Turing machine out of bits of something or other, and have toyed with that idea at various times.
> A more elegant method would be to keep the unclosed-bracket count in unbounded > integer representation (BigInteger-style), stored in (overwriting) the portion > of the input sequence which has already been scanned. Alex Hunsley - 17 Feb 2007 01:26 GMT >> Does anyone have a regular expression pattern that would include a test for >> balanced parens of arbitrary nestedness? [quoted text clipped - 5 lines] > > Look up "the pumping lemma" for details. That thang always made me think of either pumping lemons or pumping lemmings.
The matched parenthesis problem can be done with a non-deterministic state machine, but not with a deterministic one.... hence no regular expression can do it.
Daniel Pitts - 17 Feb 2007 01:44 GMT > >> Does anyone have a regular expression pattern that would include a test for > >> balanced parens of arbitrary nestedness? [quoted text clipped - 12 lines] > state machine, but not with a deterministic one.... hence no regular > expression can do it. Hmm, I don't think your definition is correct. Regex isn't always implementedwith a DSM, it might use NDA, which still doesn't solve it.
In either case, the following is a deterministic state machine solves the problem.
private static boolean isBalancedCount(String string) { int parens = 0; for (char c : string.toCharArray()) { if (c == '(') { ++parens; } if (c == ')' && --parens < 0) { return false; } } return parens == 0; }
Alex Hunsley - 17 Feb 2007 12:16 GMT >>>> Does anyone have a regular expression pattern that would include a test for >>>> balanced parens of arbitrary nestedness? [quoted text clipped - 12 lines] > Regex isn't always implementedwith a DSM, it might use NDA, which > still doesn't solve it. What do you mean by NDA? Non-deterministic automaton?
> In either case, the following is a deterministic state machine solves > the problem. [quoted text clipped - 11 lines] > return parens == 0; > } Nope, it doesn't solve the problem. The int type in Java has finite capacity - see Integer.MAX_VALUE - so more accurately this a finite deterministic state machine. I'm actually trying to remember if 'deterministic state machine' usually implies the 'finite' part in common usage. A look at wikipedia seems to suggest that, but I wasn't looking at it for long...
So, anyway, I can think of an input that causes the above code to fail - and hence problem not solved.
To give a simplified example of this, suppose we have a Java where Integer.MAX_VALUE = 1, and .MIN_VALUE = -2. So an int can have 4 distinct values for int. Assuming this, would would your code save about the following inputs?
(((( - this would be a false positive (((((((( - another false positive ((())) - this would be a false negative
Similarly, your code above in the world of real Java would give a false positive for the input:
'(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n]
where n is any positive integer.
Basically, true parens matching is scuppered, because of the need for infinite memory!
lex
Daniel Pitts - 18 Feb 2007 18:58 GMT > >>>> Does anyone have a regular expression pattern that would include a test for > >>>> balanced parens of arbitrary nestedness? [quoted text clipped - 61 lines] > > lex (((( and (((((((( both give the correct output. ((())) also results in the correct output. Hmm, are you sure you analyzed my algorithm correctly?
As for the case of: '(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n]
You're talking about at least a 4 gigabyte string. As a mater of practicality, I find that type of input to be unreasonable, but it would be simple enough to change the code to hand off to a different algorithm:
private static boolean isBalancedCount(String string) { int parens = 0; for (char c : string.toCharArray()) { if (c == '(' && parens++ == Integer.MAX_INT) { // We can't handle the trueth, but isBalancedRe can! return isBalancedRe(string); } if (c == ')' && --parens < 0) { return false; } } return parens == 0; }
Alex Hunsley - 18 Feb 2007 19:52 GMT >> To give a simplified example of this, suppose we have a Java where >> Integer.MAX_VALUE = 1, and .MIN_VALUE = -2. So an int can have 4 [quoted text clipped - 20 lines] > ((())) also results in the correct output. Hmm, are you sure you > analyzed my algorithm correctly? I wasn't talking about your code running under normal Java, I was talking about a pretend one with very small Integer.MAX_VALUE etc. To quote myself: "To give a simplified example of this, suppose we have a Java where ... [stuff] ... assuming this, what would your code [say] about the following inputs?" I was making a simplification to make a more accessible toy example where it was easy to play with the cases where it would produce the wrong answers for 'extreme' cases.
> As for the case of: > '(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n] [quoted text clipped - 3 lines] > would be simple enough to change the code to hand off to a different > algorithm: I think we're coming back to the issue mentioned before about theory versus practicality: I do know that your code works for all practical purposes - it's correct for all practical purposes - as opposed to the computer science theoretic sense in which bracket matching can't be done (by a regular expression or by a FSA or by a finite computer).
I'm just being too theoretic for my own good, is all....
lex
Oliver Wong - 21 Feb 2007 17:21 GMT >>> The matched parenthesis problem can be done with a non-deterministic >>> state machine, but not with a deterministic one.... hence no regular >>> expression can do it. I disagree, because a proof (by construction) exists which shows that any non-deterministic state machine can be converted to a determinisitc state machine. If the NDSM has K states, then DSM will have 2^K states. If you allow for infinite NDSM, it only seems fair that you also allow for infinite DSMs.
>> Hmm, I don't think your definition is correct. >> Regex isn't always implementedwith a DSM, it might use NDA, which >> still doesn't solve it. > > What do you mean by NDA? Non-deterministic automaton? Probably. I'll come back to this in a moment later in this post.
>> In either case, the following is a deterministic state machine solves >> the problem. [quoted text clipped - 18 lines] > usage. A look at wikipedia seems to suggest that, but I wasn't looking at > it for long... The usually, when someone talks about deterministic automatons and non-deterministic automatons, they mean finite ones. So the usual abbreviations are DFA and NDFA. We can, of course, talk about infinite automatons, but there seems to be less literature on that topic (perhaps because it's equivalent to, but more complex than, some other machine, e.g. a push down stack machine or a Turing Machine or something, I don't know).
So if you're talking about finite DAs and Nodes, then they cannot solve the "balance parentheses" problem. If you're talking about infinite ones, then if a NDA can solve it, so can a DA. And if a NDA cannot solve it, neither can a DA. The two are equivalent in power.
The interests in NDAs is that they take up less "code" to represent (recall that a K state NDA translates into a 2^K state DA), *and* (and here's the part that makes them interesting to study nowadays) a quantum computer can run NDAs "natively", whereas a classical computer would have to emulate the NDA by converting it (either implicitly or explicitly) into a DA, and gain that 2^K performance penalty factor.
- Oliver
Daniel Pitts - 21 Feb 2007 18:03 GMT > >>> The matched parenthesis problem can be done with a non-deterministic > >>> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 57 lines] > > - Oliver Its important to say that an NDA with K states has an equivalent DA with at most 2^K states. The set of all NDA contains the set of all Deterministic Automota. Determinism is a property of the stateful automata.
For example, if you have the simple regex "abcd", the NDA and DA produce the same graph, since it is only possible to be in one state at a time. (s)->a->b->c->d->(e)
Whereas "a[bc]d" produces the NDA (s)->a / \ b c \ / d->(e) and the DA (s)->a->b and c->d->(e)
The DA actually has fewer "states" than the NDA inthis case (as b and c are combined)
Christian - 21 Feb 2007 19:03 GMT Daniel Pitts schrieb:
>>>>> The matched parenthesis problem can be done with a non-deterministic >>>>> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 72 lines] > The DA actually has fewer "states" than the NDA inthis case (as b and > c are combined) for me important would be to say that java's regexp seem not to be equivalent to regexp in science (which have the same power as DFA or NFA or epsilon-FA)
javas regexp allow you to backreference something you mached earlier which might make them as powerful as a ContexFree-grammar/PDA (may be I don't know, just a guess)
so it seems for me currently not proven that you can't do this parens matching... but this seems to be rather scientific intrest...there are better ways to do it...
Oliver Wong - 21 Feb 2007 19:09 GMT [long discussion snipped]
> Daniel Pitts schrieb: >> >> Its important to say that an NDA with K states has an equivalent DA >> with at most 2^K states. Right.
>> The set of all NDA contains the set of all >> Deterministic Automota. Determinism is a property of the stateful >> automata. Right, that's what I intended to convey.
[...]
> for me important would be to say that java's regexp seem not to be > equivalent to regexp in science (which have the same power as DFA or NFA > or epsilon-FA) Most modern languages's regexp implementation seem to be derived from Perl's regexp implementation, which was already more powerful than "classical" regular expressions, DFAs and NFAs.
- Oliver
Gordon Beaton - 22 Feb 2007 09:16 GMT > Most modern languages's regexp implementation seem to be derived > from Perl's regexp implementation, which was already more powerful > than "classical" regular expressions, DFAs and NFAs. Most Perl regular expressions can be rewritten using "classical" regexps, IIUC it's only backreferences that can't.
On the other hand, the Perl (Java, Python, Ruby, etc) regexp implementation is extremely *slow*:
http://swtch.com/~rsc/regexp/regexp1.html
/gordon
 Signature [ don't email me support questions or followups ] g o r d o n + n e w s @ b a l d e r 1 3 . s e
Chris Uppal - 22 Feb 2007 13:22 GMT > On the other hand, the Perl (Java, Python, Ruby, etc) regexp > implementation is extremely *slow*: > > http://swtch.com/~rsc/regexp/regexp1.html Thank you! I had come across that page recently, and was just wondering how I was going to find it again to mention here...
(One caveat: he says that the NFA-based approaches are "little-known" -- that seems something of an overstatement to me. I doubt if there are many people with an interest in the /implementation/ of regexps who don't know about NFA/DFA construction, at least in general terms. If all he means is that most /users/ of Perl-style regexps don't realise that there's a performance price for the complicated hacks, then I have no disagreement.)
-- chris
Gordon Beaton - 22 Feb 2007 13:34 GMT > Thank you! I had come across that page recently, and was just > wondering how I was going to find it again to mention here... For me it was exactly the opposite - I saw it on the erlang-questions list just yesterday and had to figure out where I had seen *this* thread in order to post it...
/gordon
 Signature [ don't email me support questions or followups ] g o r d o n + n e w s @ b a l d e r 1 3 . s e
Oliver Wong - 22 Feb 2007 15:09 GMT > For example, if you have the simple regex "abcd", the NDA and DA > produce the same graph, since it is only possible to be in one state [quoted text clipped - 11 lines] > The DA actually has fewer "states" than the NDA inthis case (as b and > c are combined) I was thinking about these comments a bit more, and I realized I don't really understand your notation. In particular, I have no idea how to read your DA. What you call an NDA looks like a DA to me. That is,
(s)->a / \ b c \ / d->(e)
looks like a *deterministic* finite automaton, in that there's only ever at most 1 possible transition you can take given the next input token and the current state. To make it an NDA, you'd probably need something like:
(s)->a /|\ / | \ b b c | | | h i j \ | / \|/ d->(e)
(which implements the regexp: "a(bh|bi|cj)d")
where if you're at the top state, and you see a b, you don't know whether to take the left one or the middle one.
But even then, I think (but cannot recall precisely) that the way the terms NFA and DFA were defined to me, every DFA is considered to also be an NFA (and so perhaps the N should stand for "not necessarily deterministic" rather than "non-determistic"), but not every NFA is also a DFA.
- Oliver
Chris Uppal - 22 Feb 2007 16:18 GMT > I was thinking about these comments a bit more, and I realized I don't > really understand your notation. In particular, I have no idea how to read > your DA. I think Daniel's notation is non-standard, but it does make seem to make sense. What he's doing is removing the states themselves from the picture, leaving only the trails of input words that have been recognised.
So in:
> Whereas "a[bc]d" produces the NDA > (s)->a > / \ > b c > \ / > d->(e) the first 'a' is not a state label, but marks the fact that we have accepted an 'a'. The actual state we are in has no name (in this notation) but is sort of hidden "just after" the label 'a'. From that state we can transition by accepting 'b' to an anonymous state hidden behind the label 'b', or can transition by accepting 'c' to the anonymous state hidden behind the 'c' label. And from either of those states we can accept 'd' to go to the corresponding state, and since that is a final state, we have matched the input. And the NFA does indeed recognise a(b|c)d.
I don't think the notation covers all possible cases, for all it is nice and clear for the cases it /does/ cover. The problems comes when one state has two or more incoming edges with different labels (including epsilon transitions):
(S0)---x---> (S1) | | a y | | v | (S2)---b--->[S3]
(I hope my ASCII art is legible). S0 is the start state, from which we can transition via 'x' to S1, or via 'a' to S2. From S1 and S2 we can transition via 'b' or 'y' (respectively) to S3, the accepting state. Thus it recognises (ab|xy). I don't think that can be represented in Daniel's notation without changing the definition of the NFA itself.
I don't know whether every regexp can be represented as /some/ NFA in Daniel's notation. I haven't been able to think of a counter example yet, although the NFA is sometimes more awkward than one which can be written in the classic diagram format.
> But even then, I think (but cannot recall precisely) that the way the > terms NFA and DFA were defined to me, every DFA is considered to also be > an NFA (and so perhaps the N should stand for "not necessarily > deterministic" rather than "non-determistic"), but not every NFA is also > a DFA. It depends on how you see it. Every DFA corresponds trivially to an NFA with the same structure, so in that sense the DFAs are a subset of the NFAs. But the formal definition of a DFA is different from that of a NFA, since it has a /single/' state at any one time rather than a set of states. And of course type Set<State> is not assignable to, or from, type State.
-- chris
Daniel Pitts - 22 Feb 2007 18:49 GMT On Feb 22, 8:18 am, "Chris Uppal" <chris.up...@metagnostic.REMOVE- THIS.org> wrote:
> > I was thinking about these comments a bit more, and I realized I don't > > really understand your notation. In particular, I have no idea how to read [quoted text clipped - 3 lines] > What he's doing is removing the states themselves from the picture, leaving > only the trails of input words that have been recognised. It is indeed non-standard. I was trying to represent transitions more than states, and was trying to do so concisely.
> So in: > [quoted text clipped - 35 lines] > NFA is sometimes more awkward than one which can be written in the classic > diagram format. I wasn't trying to create an all-purpose notation, I doubt it works for all regexp, but it might.
> > But even then, I think (but cannot recall precisely) that the way the > > terms NFA and DFA were defined to me, every DFA is considered to also be [quoted text clipped - 7 lines] > /single/' state at any one time rather than a set of states. And of course > type Set<State> is not assignable to, or from, type State. In Java, that may be true, but in implementing a NFA-to-DFA converter, you may choose to create a class StateSet implements State, since every State in the DFA results from a set of states in the NDA.
Daniel.
Chris Uppal - 23 Feb 2007 17:28 GMT > The interests in NDAs is that they take up less "code" to represent > (recall that a K state NDA translates into a 2^K state DA), *and* (and > here's the part that makes them interesting to study nowadays) a quantum > computer can run NDAs "natively", whereas a classical computer would have > to emulate the NDA by converting it (either implicitly or explicitly) > into a DA, and gain that 2^K performance penalty factor. BTW: As far as I can tell, the quantum theorists are claiming that a quantum FSA can recognise more than just the regular languages.
(Personally, I think that if that claim is true, and if it has physical meaning, then all it means is that the things claimed to be quantum analogues of FSAs are not actually analogous.)
-- chris
Oliver Wong - 23 Feb 2007 19:57 GMT >> The interests in NDAs is that they take up less "code" to represent >> (recall that a K state NDA translates into a 2^K state DA), *and* (and [quoted text clipped - 13 lines] > analogues > of FSAs are not actually analogous.) I've never heard of these quantum FSAs, but they're probably not what I was referring to in my quantum-computer-natively-running-NDA statement.
Usually, when a classical computer tries to run an NDA, it either converts it to an equivalent DFA (and thus risks getting an explosion of states), or it keeps a set of all possible states the NDA could be in. For example, if the classical computer thinks that the current possible set of states that the NDA could be in is {s6}, and the NDA graph looks something like...
,--a--->(s7)-->... / ...--->(s6)----a--->(s8)-->... \ `--b--->(s9)-->...
... then upon encountering the token "a", the classical single-processor computer will consume that token, and then update the set of possible states to be {s7, s8}. Then it needs to look at the next token, and compare that to the successors of s7, and iteratively, look at the successors of s8. And you need to do this for each token in the input string you're trying to accept. So the overall running time is something like O(n*l), where n is the number of states, and l is the length of the input. With multi-processors, you can start to do these computations in parallel, but you're limited by the number of processors: a P-processor classical computer could only analyze P states at a time, and in the worst case NDA, you might not have any idea what state you're in at all, and so you'd need as many processors as there are nodes the graph. P is constant with respect to the problem size, and so the running time is still O(n*l/P) == O(n*l).
Quantum physics is nowhere near my expertise, but my understanding is that with a quantum computer, there's some sort of natural fit between "having a set of possible states the NDA could be in" and "have a set of possible states a qubit could be in". You would just go through each token of the input, transition by transition, until you reach the end, and then you "observe" the qubit, revealing finally what state you ended up in (and whether or not it is an accepting state). The running time becomes O(l), directly proportional to the size of the input string, and unaffected by the size of the graph.
- Oliver
Chris Uppal - 17 Feb 2007 17:19 GMT > The matched parenthesis problem can be done with a non-deterministic > state machine, but not with a deterministic one.... hence no regular > expression can do it. I would understand "non-deterministic state machine" to mean the non-deterministic variant of finite state automaton). And the deterministic and non-deterministic versions of FSAs are equivalent in power. Do you use the term to mean something other form of automaton ?
-- chris
Alex Hunsley - 18 Feb 2007 19:59 GMT >> The matched parenthesis problem can be done with a non-deterministic >> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 3 lines] > non-deterministic variant of finite state automaton). And the deterministic > and non-deterministic versions of FSAs are equivalent in power. Are you sure about that? As far as I can recall, a non-deterministic FSA (as that is what I am meaning) can solve the bracket matching problem, but a deterministic one can't. Put it this way - how would you design a deterministic FSA that did bracket matching on arbitrary length input? You can't, if it is going to remain finite... And it's easy enough to do with a non-deterministic FSA.
Is it that *some* non-deterministic FSA can be 'unfolded' into deterministic ones, but not all, perhaps?
lex
> Do you use the > term to mean something other form of automaton ? No, FSA is right, that's basically what I'm thinking of...
Patricia Shanahan - 18 Feb 2007 21:34 GMT >>> The matched parenthesis problem can be done with a non-deterministic >>> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 9 lines] > You can't, if it is going to remain finite... And it's easy enough to do > with a non-deterministic FSA. How do you do it with a non-deterministic FSA?
> Is it that *some* non-deterministic FSA can be 'unfolded' into > deterministic ones, but not all, perhaps? No, ANY non-deterministic FSA can be unfolded into a deterministic one.
The simple, brute force, approach is to create a DFSA whose states are the sets of states of the NDFSA. It has a transition from x to y on input symbol s if, and only if, y is the set of possible states of the NDFSA after input s when in some state in x.
The DFSA's start state is the set of start states of the NDFSA. Its accept states are those states that contain at least one of the NDFSA's accept states.
The DFSA tracks the set of states the NDFSA could reach, from any start state, given the input so far.
Patricia
Alex Hunsley - 19 Feb 2007 10:19 GMT >>>> The matched parenthesis problem can be done with a non-deterministic >>>> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 12 lines] > > How do you do it with a non-deterministic FSA? Here's one I just sketched. Does it look right?
http://farm1.static.flickr.com/136/395180058_ea4d66f451.jpg?v=0
But I can't see how this can be made into a DFSA because from what I sketched for that you end up with a big 'ladder' of states that stretches off to infinity (hence non finite).
>> Is it that *some* non-deterministic FSA can be 'unfolded' into >> deterministic ones, but not all, perhaps? [quoted text clipped - 12 lines] > The DFSA tracks the set of states the NDFSA could reach, from any start > state, given the input so far. Thanks for the reminder of all that! lex
Chris Uppal - 19 Feb 2007 13:26 GMT > Here's one I just sketched. Does it look right? > > http://farm1.static.flickr.com/136/395180058_ea4d66f451.jpg?v=0 I don't think that does what you intend. For instance, if it sees input string ")" then it'll end up in states {A,B}. The ')' will take it from the start state, A, to state A, and also (non-deterministically) to state B. Since A is accepting that means it will accept the input ")" -- which is rather noticeably not a string of balanced, properly nested, brackets ;-)
> But I can't see how this can be made into a DFSA because from what I > sketched for that you end up with a big 'ladder' of states that > stretches off to infinity (hence non finite). It is equivalent to a DFA with two states, I'll call 'em S0 and S1. S0 is the start state and is accepting. There are four transitions as follows (I hope the notation is self-evident): S0 -- ) --> S0 S0 -- ( --> S1 S1 -- ( --> S1 S1 -- ) --> S0
That DFA was generated and minimised mechanically, btw. It is always possible to transform an NFA into an equivalent DFA, and what's more it can be done by a fairly simple algorithm -- no human insight required ;-) The key point is that, since the NFA has -- by definition -- a finite set of states, the set of /sets/ of states is also finite. And by constructing a DFA with a state corresponding to each (non-empty) set of NFA states, you can translate the NFA into (typically rather bigger) DFA. If you'd like more details then you could start with the Wikipeadia article: http://en.wikipedia.org/wiki/Powerset_construction (I haven't read it, so I don't know how clear, or even correct, it is, but it does have convincing-looking pictures ;-) Or try Google: nfa dfa construction
The algorithm yields a 3-state FSA with 6 transitions, and then minimising that produces the above.
-- chris
Patricia Shanahan - 19 Feb 2007 14:32 GMT >>>>> The matched parenthesis problem can be done with a non-deterministic >>>>> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 16 lines] > > http://farm1.static.flickr.com/136/395180058_ea4d66f451.jpg?v=0 It accepts any string whose last character is ")", regardless of whether the parentheses are balanced or not. There is an edge for ")" leading to the accepting state A from both itself and B.
> But I can't see how this can be made into a DFSA because from what I > sketched for that you end up with a big 'ladder' of states that > stretches off to infinity (hence non finite). Each rung on the ladder can be represented by a set of states of the NDFSA, and the next set of states depends only on the current set of states and the input.
Patricia
Christian - 19 Feb 2007 15:32 GMT Patricia Shanahan schrieb:
>>>>>> The matched parenthesis problem can be done with a non-deterministic >>>>>> state machine, but not with a deterministic one.... hence no regular [quoted text clipped - 26 lines] >> sketched for that you end up with a big 'ladder' of states that >> stretches off to infinity (hence non finite). the power set of any finite set is finite! http://en.wikipedia.org/wiki/Powerset_construction basically the same for any automat or machine ... non deterministic and deterministic allways have the same power .. even on a touring machine ... non-deterministic may just be faster...
> Each rung on the ladder can be represented by a set of states of the > NDFSA, and the next set of states depends only on the current set of > states and the input. > > Patricia Daniel Pitts - 17 Feb 2007 01:14 GMT > Does anyone have a regular expression pattern that would include a test for > balanced parens of arbitrary nestedness? Two ways:
public class NestedParens { public static void main(String[] args) { String[] strings = { "Test case 1", "(Test case 2)", "((Test) case ) 3", ")test case 4 (", }; for (String string: strings) { System.out.println(string + " is " + (isBalancedRe(string) ? "Balanced" : "Mis-matched")); }
for (String string: strings) { System.out.println(string + " is " + (isBalancedCount(string) ? "Balanced" : "Mis-matched")); } }
public static boolean isBalancedRe(String string) { String last; do { last = string; string = string.replaceAll("\\([^()]*\\)", ""); } while (!last.equals(string)); return !string.matches(".*[()].*"); }
private static boolean isBalancedCount(String string) { int parens = 0; for (char c : string.toCharArray()) { switch (c) { case ')': if (--parens < 0) { return false; } break; case '(': ++parens; } } return parens == 0; } }
Daniel Pitts - 17 Feb 2007 01:28 GMT On Feb 16, 5:14 pm, "Daniel Pitts" <googlegrou...@coloraura.com> wrote:
> > Does anyone have a regular expression pattern that would include a test for > > balanced parens of arbitrary nestedness? [quoted text clipped - 50 lines] > > } Or, a little more concisely: public static boolean isBalancedRe(String string) { while (string.matches(".*\\([^()]*\\).*")) { string = string.replaceAll("\\([^()]*\\)", ""); } return !string.matches(".*[()].*"); }
private static boolean isBalancedCount(String string) { int parens = 0; for (char c : string.toCharArray()) { if (c == '(') { ++parens; } if (c == ')' && --parens < 0) { return false; } } return parens == 0; } ...
Or, to get down right esoteric: private static boolean isBalancedCount(String string) { int parens = 0; for (char c : string.toCharArray()) { if ((c == '(' && ++ parens < 0) || c==')' && --parens < 0) { return false; } } return parens == 0; }
CafeBabe points to those who know why that works the way it does.
Chris Uppal - 17 Feb 2007 17:21 GMT > private static boolean isBalancedCount(String string) { > int parens = 0; [quoted text clipped - 6 lines] > return parens == 0; > } I have rarely seen code I liked less. Well done ;-)
-- chris
Alex Hunsley - 19 Feb 2007 10:41 GMT > Or, to get down right esoteric: > private static boolean isBalancedCount(String string) { [quoted text clipped - 9 lines] > > CafeBabe points to those who know why that works the way it does. I saw CafeBabe scavenging some fields for DeadBeef last night! Poor babe.... lex
Ken Kast - 17 Feb 2007 03:48 GMT Absolutely never thought I would generate so much conversation. I need to sit down with a glass of wine and digest this.
I probably should have been more transparent in my original post. I'm not trying to test a string for balanced parens. I need it to be a match termination condition. In other words, I have a fairly complex pattern. So it tells when a match is done. I need to say the match is done if the character pattern is played out or the match will play out before I balance parens.
I first wrote this in C#. .Net has extended RegExp that lets you swap named groups on condition. This effectively lets you model a stack. So you push left parens on the stack and pop them with right ones. Now I want to implement the same app in Java. I can certainly assume a low level nesting, on the order of 3 or 4, so running out of memory or running out of time aren't concerns. I haven't thought it through, but I could probably do it brute force by iterating thru a set of patterns. That would solve 6 sigma's worth of real world situations. I was hoping for something more elegant.
Thanks to everyone for your thoughts. I certainly learned how lemmings have influenced computer science.
Ken
> Does anyone have a regular expression pattern that would include a test > for balanced parens of arbitrary nestedness? Chris Uppal - 17 Feb 2007 17:27 GMT > I probably should have been more transparent in my original post. I'm not > trying to test a string for balanced parens. I need it to be a match > termination condition. In other words, I have a fairly complex pattern. > So it tells when a match is done. I need to say the match is done if the > character pattern is played out or the match will play out before I > balance parens. It sounds to me as if this is a mis-application of regular expressions (IMO nearly all applications of regular expressions in "real" programming languages /are/ mis-applications).
Would it be easier to use bracket matching to delimit the sub-sequence you want with real code and then (if regexes still seem like a good idea) use regexes to pick it apart ?
Alternatively, use regexes to "tokenise" the input (identifying (, ), and the other stuff you are interested in) and then use real code to traverse the tokens (words) using bracket-counting to keep track of where you are in the input stream.
> I first wrote this in C#. .Net has extended RegExp that lets you swap > named groups on condition. <shudder/>
-- chris
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|