Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2006

Tip: Looking for answers? Try searching our database.

Google-like query tokenizer

Thread view: 
aaronfude@gmail.com - 05 Jul 2006 10:53 GMT
Hi,

Is there a java utilitily that can tokenize a google-like query?
Meaning that tokens are separated by spaces unless grouped with
parentheses. Can the StringTokenizer do this?

Very many thanks in advance!

Aaron Fude
Bart Cremers - 05 Jul 2006 11:26 GMT
aaronfude@gmail.com schreef:

> Hi,
>
[quoted text clipped - 5 lines]
>
> Aaron Fude

This can be easily achieved using regular expressions:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class QuerySplit {
   private static String query = "test \"one two three\" more testing
\"one two\" done";

   private static String regex = "\"[^\"]*\"|[^\\s]+";

   public static void main(String[] args) {
       Pattern pattern = Pattern.compile(regex);

       Matcher matcher = pattern.matcher(query);

       while (matcher.find()) {
           String toSearch = query.substring(matcher.start(),
matcher.end());
           if (toSearch.startsWith("\"") && toSearch.endsWith("\"")) {
               toSearch = toSearch.substring(1, toSearch.length() -
1);
           }
           System.out.println(toSearch);
       }
   }
}

Regards,

Bart
bugbear - 05 Jul 2006 13:13 GMT
> Hi,
>
> Is there a java utilitily that can tokenize a google-like query?
> Meaning that tokens are separated by spaces unless grouped with
> parentheses. Can the StringTokenizer do this?

If you wish to also hande stuff like '|' (OR)
and quotes (for phrases) and the '+' and '-'
stuff, you'll need quite a "complete" little
parser implementation.

 BugBear
Oliver Wong - 05 Jul 2006 15:50 GMT
> Hi,
>
> Is there a java utilitily that can tokenize a google-like query?
> Meaning that tokens are separated by spaces unless grouped with
> parentheses. Can the StringTokenizer do this?

   I believe that StringTokenizer on its own can't do it, though you could
use StringTokenizer as part of an implementation of a state machine to
achieve what you want.

   Can the parentheses be nested? E.g. is this legal: "a ( b c ( d e ) f
g ) h i"?

   - Oliver
Alan Meyer - 06 Jul 2006 01:14 GMT
> Hi,
>
[quoted text clipped - 5 lines]
>
> Aaron Fude

There are a number of lex/yacc like implementations in
Java.  Google for "lex yacc java" to find them.

Lex and yacc are ancient UNIX compiler construction tools.

Lex is a lexical analyzer that breaks a string into tokens.
Yacc is a parser generator that recognizes production
rules defining a syntax.

If you've never used them or heard of them, you'll find
they require a significant learning curve to master.  But
once mastered, they allow you to build very complicated
parsers for all kinds of different syntactical rules using
very little code.

   Alan
Martin Gregorie - 06 Jul 2006 21:01 GMT
>> Hi,
>>
[quoted text clipped - 20 lines]
> parsers for all kinds of different syntactical rules using
> very little code.

I'd suggest you look at Coco/R, which you can find at:

 http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/

Coco/R is available for several languages including Java. I've used the
Java version to develop a parser for POSIX C code generation
conditionals (#if and friends for the C speakers) and found it worked
well and is somewhat easier to use than lex and yacc.

Its biggest benefit is that its a single code generator that generates
both the tokeniser and the parser. The documentation, supplied in PDF
format, is pretty good too.

Another benefit is that the code skeletons for its generated tokeniser
and parser classes can be easily modified if you're reasonably
competent. The standard generated code assumes input is from a file, but
I needed to be able to process a string. Making that change was trivial.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Andrew Lampert - 08 Jul 2006 11:12 GMT
> >> Hi,
> >>
[quoted text clipped - 20 lines]
> > parsers for all kinds of different syntactical rules using
> > very little code.

An alternative to CocoR that has already been suggested is JavaCC - see
http://javacc.dev.java.net. Well implemented and supported, with a
large community of users. I've used it in the distant past (about 5
years ago) and it suited my needs perfectly for building a reasonably
complex parser.

Cheers,
Andrew


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.