Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2007

Tip: Looking for answers? Try searching our database.

Algorithm considerations

Thread view: 
Ike - 26 Feb 2007 18:05 GMT
I am trying to determine an efficient algorithm to discern if a preposition
occurs in a sentence. Prepositions are runs of given words which are
anywhere from one to three words long. I already have a list of these
propositions (for example, 'with' is in the list as a preposition, as well
as 'with regards to').

I want to find if a sentence contains one of these propositions, and if it
does, use the longest preposition (in other words, if 'with regards to'
appears in the sentence, I want that to be the preposition that is chosen
there, not 'with').

The only caveat is that the sentence itself is already parsed, word-by-word,
into an array of Strings. In other words "The cat is in the hat" is a
String[0]="The", String[1]="cat"...String[5]="hat".

As for how I store my list of prepositions themselves, I can store them in
any type of collection(s). What I am trying to discern is what is an
efficient means for finding what preopositions occur in this String[],
taking into account that if more than one preposition traverses the same
word, the longest preposition is selected.

Does anyone have any ideas on what might be an efficient means to do this?
Thanks, Ike
Daniel Pitts - 26 Feb 2007 18:37 GMT
> I am trying to determine an efficient algorithm to discern if a preposition
> occurs in a sentence. Prepositions are runs of given words which are
[quoted text clipped - 19 lines]
> Does anyone have any ideas on what might be an efficient means to do this?
> Thanks, Ike

Considering your input size seems small enough, I would start with the
brute force approach.
In other words:

for every position in my word array
  find the longest proposition at that location.

This is an O(n^2) algorithm, but since sentences are generally < 10
words, it should be good enough.

Only after you make your application do what you want should you worry
about efficiency.  If you find it is running too slow, use a profiler
to find out why and where.

If it happens to be that logic above, I would suggest using a
deterministic-finate-state-automata (DFA)  It will run in linear time,
but its more complex to implement.
Basically, you'd be implementing a special-case regexp with tokens
that are words instead of letters.
Eric Sosman - 26 Feb 2007 21:11 GMT
Daniel Pitts wrote On 02/26/07 13:37,:
> [...]
>
> This is an O(n^2) algorithm, but since sentences are generally < 10
> words, it should be good enough.

   At least fifteen words, more depending on how you count
the mathematical notations.

> Only after you make your application do what you want should you worry
> about efficiency.

   Fifteen words.

> If you find it is running too slow, use a profiler
> to find out why and where.

   Seventeen words.

> If it happens to be that logic above, I would suggest using a
> deterministic-finate-state-automata (DFA)

   At least sixteen words.

> It will run in linear time,
> but its more complex to implement.

   Twelve words.

> Basically, you'd be implementing a special-case regexp with tokens
> that are words instead of letters.

   Fifteen or sixteen words.

   Total: Ninety or more words in six sentences, for an
average length of at least fifteen words per sentence.  An
O(n^2) algorithm will therefore run about 125% longer than
you might have anticipated.  Do not, on pain of tedium, use
this method on nineteenth-century German philosophical
treatises!  ;-)

Signature

Eric.Sosman@sun.com

Daniel Pitts - 26 Feb 2007 22:54 GMT
> Daniel Pitts wrote On 02/26/07 13:37,:
>
[quoted text clipped - 40 lines]
> --
> Eric.Sos...@sun.com

My estimates were off, but my advice is sound.
Lew - 26 Feb 2007 23:06 GMT
Ike said:
> (for example, 'with' is in the list as a preposition, as well
> as 'with regards to')

I would not diagram "with regards [sic] to" as a preposition. I wouldn't even
diagram "with regard to" as a preposition. A prepositional phrase, perhaps.

There are two prepositions in that snippet, "with" and "to". The object of
"with" is "regard", and the object of "to" is some noun that will follow.

- Lew
Mark Jeffcoat - 27 Feb 2007 01:03 GMT
> Considering your input size seems small enough, I would start with the
> brute force approach.
[quoted text clipped - 5 lines]
> This is an O(n^2) algorithm, but since sentences are generally < 10
> words, it should be good enough.

Since we have a list of all the possible prepositions, and
the longest preposition is of (small) constant length, this
is a O(n) algorithm.

You're not going to be able to beat that for asymptotic
efficiency. [Bad proof: someone might actually end a sentence
with a preposition].

As far as actual efficiency is concerned, I'd go with Daniel's
DFA idea.

Signature

Mark Jeffcoat
Austin, TX

Ike - 27 Feb 2007 00:22 GMT
I took Daniel's advice -- and it works fine. Considering there are, at the
outside, 150-160 preposition(al phrases of 1-3 words). Code enclosed in case
someone upon searching this archive has a similar problem:

/** typically you would set startpoint to 0, but I have amended the
* code here such that you can start at any arbitrary point in the sentence
*/
    private void startwithprep(String [] ta, int startpoint){
       StringBuffer sb=null;
       sb=new StringBuffer();
       String tester=null;
       for(int i=0;i<3;i++){
           if(ta.length>startpoint+i){
               sb.append(ta[startpoint+i]);
               tester=sb.toString().toLowerCase();
               if(prepositions.contains(tester)){  //prepositions is an
ArrayList of Strings
                    //here's the match, do as you like
               }
               sb.append(" ");
           }
       }
   }
Mark Jeffcoat - 27 Feb 2007 01:46 GMT
> /** typically you would set startpoint to 0, but I have amended the
> * code here such that you can start at any arbitrary point in the sentence
[quoted text clipped - 15 lines]
>         }
>     }

Cool. Can you make your preposition list available? Your
problem is likely solved, but I'd be interested looking at
the performance numbers for a couple of different approachs;
I'm curious about how much difference micro-optimizations
could really make in a case like this.

On a tangent, I'd always write this function as returning a
String, returning where you have "//here's the match" (and
probably null on no match.) I think that style gives you
better functional decomposition, and makes it much simpler
to test startWithPrep in isolation.

(If pressed further, I'd mumble something about Bertrand
Meyer's Command/Query separation, and hint that I'd maybe
read his 1300 page book on OOP, even though that is, in
fact, quite false.)

Signature

Mark Jeffcoat
Austin, TX

Ike - 27 Feb 2007 14:02 GMT
----- Original Message -----
From: "Mark Jeffcoat" <jeffcoat@alumni.rice.edu>
Newsgroups: comp.lang.java.programmer
Sent: Monday, February 26, 2007 8:46 PM
Subject: Re: Algorithm considerations

> Cool. Can you make your preposition list available? Your
> problem is likely solved, but I'd be interested looking at
> the performance numbers for a couple of different approachs;
> I'm curious about how much difference micro-optimizations
> could really make in a case like this.

I"VE ENCLOSED IT AT THE BOTTOM AS SQL COMMANDS IN CASE SOMEONE WANTS TO USE
IT IN MYSQL AS I AM DOING.

> On a tangent, I'd always write this function as returning a
> String, returning where you have "//here's the match" (and
> probably null on no match.) I think that style gives you
> better functional decomposition, and makes it much simpler
> to test startWithPrep in isolation.

I"M ACTUALLY USING IT TO RETAG PARTS-OF-SPEECH, WITH BRILL POS TAGS. SO IN
MY INSTANTIATION OF IT, I AM PASSING THE FUNCTION ANOTHER ARRAY WHICH IS THE
TAGS CORRESPONDING TO THE WORDS, AND SWAPPING THE TAGS WITH 'IN' IN LIEU OF
WHAT THEY ARE (UNLESS THEY ARE ALREADY TAGGED 'CS').

HERE'S MY LIST. I AM NOT INCLUDING POSTPREPOSITIONS (LIKE 'ago') .
ADDITIONALLY, MY alsoCS FIELD MEANS THAT THIS WORD IS ALSO OFTEN USED AS A
SUBORDINATING CONJUNCTION:

# --------------------------------------------------------

#
# Table structure for table `prepositions`
#

CREATE TABLE `prepositions` (
 `id` int(3) NOT NULL auto_increment,
 `preposition` varchar(40) NOT NULL default '',
 `alsoCS` int(3) NOT NULL default '0',
 PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Dumping data for table `prepositions`
#

INSERT INTO `prepositions` VALUES (1, 'abaft', 0);
INSERT INTO `prepositions` VALUES (2, 'about', 0);
INSERT INTO `prepositions` VALUES (3, 'above', 0);
INSERT INTO `prepositions` VALUES (4, 'according to', 0);
INSERT INTO `prepositions` VALUES (5, 'across', 0);
INSERT INTO `prepositions` VALUES (6, 'after', 1);
INSERT INTO `prepositions` VALUES (7, 'against', 0);
INSERT INTO `prepositions` VALUES (8, 'ahead of', 0);
INSERT INTO `prepositions` VALUES (9, 'along', 0);
INSERT INTO `prepositions` VALUES (10, 'along with', 0);
INSERT INTO `prepositions` VALUES (11, 'alongside', 0);
INSERT INTO `prepositions` VALUES (12, 'amid', 0);
INSERT INTO `prepositions` VALUES (13, 'among', 0);
INSERT INTO `prepositions` VALUES (14, 'amongst', 0);
INSERT INTO `prepositions` VALUES (15, 'apart from', 0);
INSERT INTO `prepositions` VALUES (16, 'around', 0);
INSERT INTO `prepositions` VALUES (17, 'as', 1);
INSERT INTO `prepositions` VALUES (18, 'as far as', 0);
INSERT INTO `prepositions` VALUES (19, 'as well as', 0);
INSERT INTO `prepositions` VALUES (20, 'at', 0);
INSERT INTO `prepositions` VALUES (21, 'back of', 0);
INSERT INTO `prepositions` VALUES (22, 'before', 1);
INSERT INTO `prepositions` VALUES (23, 'behind', 0);
INSERT INTO `prepositions` VALUES (24, 'below', 0);
INSERT INTO `prepositions` VALUES (25, 'beneath', 0);
INSERT INTO `prepositions` VALUES (26, 'beside', 0);
INSERT INTO `prepositions` VALUES (27, 'between', 0);
INSERT INTO `prepositions` VALUES (28, 'beyond', 0);
INSERT INTO `prepositions` VALUES (29, 'but', 0);
INSERT INTO `prepositions` VALUES (30, 'by', 0);
INSERT INTO `prepositions` VALUES (31, 'concerning', 0);
INSERT INTO `prepositions` VALUES (32, 'contrary to', 0);
INSERT INTO `prepositions` VALUES (33, 'despite', 0);
INSERT INTO `prepositions` VALUES (34, 'down', 0);
INSERT INTO `prepositions` VALUES (35, 'during', 0);
INSERT INTO `prepositions` VALUES (36, 'except', 0);
INSERT INTO `prepositions` VALUES (37, 'excepting', 0);
INSERT INTO `prepositions` VALUES (38, 'for', 0);
INSERT INTO `prepositions` VALUES (39, 'from', 0);
INSERT INTO `prepositions` VALUES (40, 'in', 0);
INSERT INTO `prepositions` VALUES (41, 'in addition to', 0);
INSERT INTO `prepositions` VALUES (42, 'in back of', 0);
INSERT INTO `prepositions` VALUES (43, 'in front of', 0);
INSERT INTO `prepositions` VALUES (44, 'in lieu of', 0);
INSERT INTO `prepositions` VALUES (45, 'in place of', 0);
INSERT INTO `prepositions` VALUES (46, 'in regard to', 0);
INSERT INTO `prepositions` VALUES (47, 'in spite of', 0);
INSERT INTO `prepositions` VALUES (48, 'in view of', 0);
INSERT INTO `prepositions` VALUES (49, 'inside', 0);
INSERT INTO `prepositions` VALUES (50, 'instead of', 0);
INSERT INTO `prepositions` VALUES (51, 'into', 0);
INSERT INTO `prepositions` VALUES (52, 'like', 0);
INSERT INTO `prepositions` VALUES (53, 'near', 0);
INSERT INTO `prepositions` VALUES (54, 'of', 0);
INSERT INTO `prepositions` VALUES (55, 'off', 0);
INSERT INTO `prepositions` VALUES (56, 'on', 0);
INSERT INTO `prepositions` VALUES (57, 'on account of', 0);
INSERT INTO `prepositions` VALUES (58, 'out', 0);
INSERT INTO `prepositions` VALUES (59, 'out of', 0);
INSERT INTO `prepositions` VALUES (60, 'outside', 0);
INSERT INTO `prepositions` VALUES (61, 'over', 0);
INSERT INTO `prepositions` VALUES (62, 'past', 0);
INSERT INTO `prepositions` VALUES (63, 'rather than', 0);
INSERT INTO `prepositions` VALUES (64, 'regarding', 0);
INSERT INTO `prepositions` VALUES (65, 'round', 0);
INSERT INTO `prepositions` VALUES (66, 'since', 1);
INSERT INTO `prepositions` VALUES (67, 'through', 0);
INSERT INTO `prepositions` VALUES (68, 'throughout', 0);
INSERT INTO `prepositions` VALUES (69, 'till', 0);
INSERT INTO `prepositions` VALUES (70, 'to', 0);
INSERT INTO `prepositions` VALUES (71, 'together with', 0);
INSERT INTO `prepositions` VALUES (72, 'toward', 0);
INSERT INTO `prepositions` VALUES (73, 'towards', 0);
INSERT INTO `prepositions` VALUES (74, 'under', 0);
INSERT INTO `prepositions` VALUES (75, 'underneath', 0);
INSERT INTO `prepositions` VALUES (76, 'until', 1);
INSERT INTO `prepositions` VALUES (77, 'unto', 0);
INSERT INTO `prepositions` VALUES (78, 'up', 0);
INSERT INTO `prepositions` VALUES (79, 'up to', 0);
INSERT INTO `prepositions` VALUES (80, 'upon', 0);
INSERT INTO `prepositions` VALUES (81, 'versus', 0);
INSERT INTO `prepositions` VALUES (82, 'via', 0);
INSERT INTO `prepositions` VALUES (83, 'with', 0);
INSERT INTO `prepositions` VALUES (84, 'with regard to', 0);
INSERT INTO `prepositions` VALUES (85, 'within', 0);
INSERT INTO `prepositions` VALUES (86, 'without', 0);
INSERT INTO `prepositions` VALUES (87, 'worth', 0);
INSERT INTO `prepositions` VALUES (88, 'with regards to', 0);
INSERT INTO `prepositions` VALUES (89, 'aboard', 0);
INSERT INTO `prepositions` VALUES (90, 'absent', 0);
INSERT INTO `prepositions` VALUES (91, 'amidst', 0);
INSERT INTO `prepositions` VALUES (92, 'astride', 0);
INSERT INTO `prepositions` VALUES (93, 'atop', 0);
INSERT INTO `prepositions` VALUES (94, 'besides', 0);
INSERT INTO `prepositions` VALUES (95, 'following', 0);
INSERT INTO `prepositions` VALUES (96, 'notwithstanding', 0);
INSERT INTO `prepositions` VALUES (97, 'mid', 0);
INSERT INTO `prepositions` VALUES (98, 'minus', 0);
INSERT INTO `prepositions` VALUES (99, 'onto', 0);
INSERT INTO `prepositions` VALUES (100, 'opposite', 0);
INSERT INTO `prepositions` VALUES (101, 're', 0);
INSERT INTO `prepositions` VALUES (102, 'subsequent to', 0);
INSERT INTO `prepositions` VALUES (103, 'prior to', 0);
INSERT INTO `prepositions` VALUES (104, 'next to', 0);
INSERT INTO `prepositions` VALUES (105, 'near to', 0);
INSERT INTO `prepositions` VALUES (106, 'owing to', 0);
INSERT INTO `prepositions` VALUES (107, 'outside of', 0);
INSERT INTO `prepositions` VALUES (108, 'on to', 0);
INSERT INTO `prepositions` VALUES (109, 'in to', 0);
INSERT INTO `prepositions` VALUES (110, 'inside of', 0);
INSERT INTO `prepositions` VALUES (111, 'far from', 0);
INSERT INTO `prepositions` VALUES (112, 'as to', 0);
INSERT INTO `prepositions` VALUES (113, 'aside from', 0);
INSERT INTO `prepositions` VALUES (114, 'because of', 0);
INSERT INTO `prepositions` VALUES (115, 'close to', 0);
INSERT INTO `prepositions` VALUES (116, 'due to', 0);
INSERT INTO `prepositions` VALUES (117, 'by means of', 0);
INSERT INTO `prepositions` VALUES (118, 'in accordance with', 0);
INSERT INTO `prepositions` VALUES (119, 'on behalf of', 0);
INSERT INTO `prepositions` VALUES (120, 'on top of', 0);
INSERT INTO `prepositions` VALUES (121, 'in case of', 0);
INSERT INTO `prepositions` VALUES (122, 'betwixt', 0);
INSERT INTO `prepositions` VALUES (123, 'circa', 0);
INSERT INTO `prepositions` VALUES (124, 'anti', 0);
INSERT INTO `prepositions` VALUES (125, 'cum', 0);
INSERT INTO `prepositions` VALUES (126, 'per', 0);
INSERT INTO `prepositions` VALUES (127, 'qua', 0);
INSERT INTO `prepositions` VALUES (128, 'sans', 0);
INSERT INTO `prepositions` VALUES (129, 'vis-a-vis', 0);
INSERT INTO `prepositions` VALUES (130, 'vis a vis', 0);
Oliver Wong - 27 Feb 2007 15:18 GMT
[context is: picking out keywords -- "prepositions" -- out of
already-parsed sentences.]

> #
> # Table structure for table `prepositions`
[quoted text clipped - 15 lines]
> INSERT INTO `prepositions` VALUES (3, 'above', 0);
> INSERT INTO `prepositions` VALUES (4, 'according to', 0);
[and so on...]

You might want to make your preposition column indexable, as you will
probably be searching through that column pretty frequently. And if the
values in the preposition column are unique, you might want to drop the
`id` column altogether and make your preposition column your primary key.

   - Oliver
Ike - 27 Feb 2007 16:32 GMT
> [context is: picking out keywords -- "prepositions" -- out of
> already-parsed sentences.]
[quoted text clipped - 5 lines]
>
>    - Oliver

As you can see from the code I enclosed, I actually just read it directly
into an ArrayList and work with the ArrayList itself.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.