Java Forum / General / February 2007
Algorithm considerations
Ike - 26 Feb 2007 18:05 GMT I am trying to determine an efficient algorithm to discern if a preposition occurs in a sentence. Prepositions are runs of given words which are anywhere from one to three words long. I already have a list of these propositions (for example, 'with' is in the list as a preposition, as well as 'with regards to').
I want to find if a sentence contains one of these propositions, and if it does, use the longest preposition (in other words, if 'with regards to' appears in the sentence, I want that to be the preposition that is chosen there, not 'with').
The only caveat is that the sentence itself is already parsed, word-by-word, into an array of Strings. In other words "The cat is in the hat" is a String[0]="The", String[1]="cat"...String[5]="hat".
As for how I store my list of prepositions themselves, I can store them in any type of collection(s). What I am trying to discern is what is an efficient means for finding what preopositions occur in this String[], taking into account that if more than one preposition traverses the same word, the longest preposition is selected.
Does anyone have any ideas on what might be an efficient means to do this? Thanks, Ike
Daniel Pitts - 26 Feb 2007 18:37 GMT > I am trying to determine an efficient algorithm to discern if a preposition > occurs in a sentence. Prepositions are runs of given words which are [quoted text clipped - 19 lines] > Does anyone have any ideas on what might be an efficient means to do this? > Thanks, Ike Considering your input size seems small enough, I would start with the brute force approach. In other words:
for every position in my word array find the longest proposition at that location.
This is an O(n^2) algorithm, but since sentences are generally < 10 words, it should be good enough.
Only after you make your application do what you want should you worry about efficiency. If you find it is running too slow, use a profiler to find out why and where.
If it happens to be that logic above, I would suggest using a deterministic-finate-state-automata (DFA) It will run in linear time, but its more complex to implement. Basically, you'd be implementing a special-case regexp with tokens that are words instead of letters.
Eric Sosman - 26 Feb 2007 21:11 GMT Daniel Pitts wrote On 02/26/07 13:37,:
> [...] > > This is an O(n^2) algorithm, but since sentences are generally < 10 > words, it should be good enough. At least fifteen words, more depending on how you count the mathematical notations.
> Only after you make your application do what you want should you worry > about efficiency. Fifteen words.
> If you find it is running too slow, use a profiler > to find out why and where. Seventeen words.
> If it happens to be that logic above, I would suggest using a > deterministic-finate-state-automata (DFA) At least sixteen words.
> It will run in linear time, > but its more complex to implement. Twelve words.
> Basically, you'd be implementing a special-case regexp with tokens > that are words instead of letters. Fifteen or sixteen words.
Total: Ninety or more words in six sentences, for an average length of at least fifteen words per sentence. An O(n^2) algorithm will therefore run about 125% longer than you might have anticipated. Do not, on pain of tedium, use this method on nineteenth-century German philosophical treatises! ;-)
 Signature Eric.Sosman@sun.com
Daniel Pitts - 26 Feb 2007 22:54 GMT > Daniel Pitts wrote On 02/26/07 13:37,: > [quoted text clipped - 40 lines] > -- > Eric.Sos...@sun.com My estimates were off, but my advice is sound.
Lew - 26 Feb 2007 23:06 GMT Ike said:
> (for example, 'with' is in the list as a preposition, as well > as 'with regards to') I would not diagram "with regards [sic] to" as a preposition. I wouldn't even diagram "with regard to" as a preposition. A prepositional phrase, perhaps.
There are two prepositions in that snippet, "with" and "to". The object of "with" is "regard", and the object of "to" is some noun that will follow.
- Lew
Mark Jeffcoat - 27 Feb 2007 01:03 GMT > Considering your input size seems small enough, I would start with the > brute force approach. [quoted text clipped - 5 lines] > This is an O(n^2) algorithm, but since sentences are generally < 10 > words, it should be good enough. Since we have a list of all the possible prepositions, and the longest preposition is of (small) constant length, this is a O(n) algorithm.
You're not going to be able to beat that for asymptotic efficiency. [Bad proof: someone might actually end a sentence with a preposition].
As far as actual efficiency is concerned, I'd go with Daniel's DFA idea.
 Signature Mark Jeffcoat Austin, TX
Ike - 27 Feb 2007 00:22 GMT I took Daniel's advice -- and it works fine. Considering there are, at the outside, 150-160 preposition(al phrases of 1-3 words). Code enclosed in case someone upon searching this archive has a similar problem:
/** typically you would set startpoint to 0, but I have amended the * code here such that you can start at any arbitrary point in the sentence */ private void startwithprep(String [] ta, int startpoint){ StringBuffer sb=null; sb=new StringBuffer(); String tester=null; for(int i=0;i<3;i++){ if(ta.length>startpoint+i){ sb.append(ta[startpoint+i]); tester=sb.toString().toLowerCase(); if(prepositions.contains(tester)){ //prepositions is an ArrayList of Strings //here's the match, do as you like } sb.append(" "); } } }
Mark Jeffcoat - 27 Feb 2007 01:46 GMT > /** typically you would set startpoint to 0, but I have amended the > * code here such that you can start at any arbitrary point in the sentence [quoted text clipped - 15 lines] > } > } Cool. Can you make your preposition list available? Your problem is likely solved, but I'd be interested looking at the performance numbers for a couple of different approachs; I'm curious about how much difference micro-optimizations could really make in a case like this.
On a tangent, I'd always write this function as returning a String, returning where you have "//here's the match" (and probably null on no match.) I think that style gives you better functional decomposition, and makes it much simpler to test startWithPrep in isolation.
(If pressed further, I'd mumble something about Bertrand Meyer's Command/Query separation, and hint that I'd maybe read his 1300 page book on OOP, even though that is, in fact, quite false.)
 Signature Mark Jeffcoat Austin, TX
Ike - 27 Feb 2007 14:02 GMT ----- Original Message ----- From: "Mark Jeffcoat" <jeffcoat@alumni.rice.edu> Newsgroups: comp.lang.java.programmer Sent: Monday, February 26, 2007 8:46 PM Subject: Re: Algorithm considerations
> Cool. Can you make your preposition list available? Your > problem is likely solved, but I'd be interested looking at > the performance numbers for a couple of different approachs; > I'm curious about how much difference micro-optimizations > could really make in a case like this. I"VE ENCLOSED IT AT THE BOTTOM AS SQL COMMANDS IN CASE SOMEONE WANTS TO USE IT IN MYSQL AS I AM DOING.
> On a tangent, I'd always write this function as returning a > String, returning where you have "//here's the match" (and > probably null on no match.) I think that style gives you > better functional decomposition, and makes it much simpler > to test startWithPrep in isolation. I"M ACTUALLY USING IT TO RETAG PARTS-OF-SPEECH, WITH BRILL POS TAGS. SO IN MY INSTANTIATION OF IT, I AM PASSING THE FUNCTION ANOTHER ARRAY WHICH IS THE TAGS CORRESPONDING TO THE WORDS, AND SWAPPING THE TAGS WITH 'IN' IN LIEU OF WHAT THEY ARE (UNLESS THEY ARE ALREADY TAGGED 'CS').
HERE'S MY LIST. I AM NOT INCLUDING POSTPREPOSITIONS (LIKE 'ago') . ADDITIONALLY, MY alsoCS FIELD MEANS THAT THIS WORD IS ALSO OFTEN USED AS A SUBORDINATING CONJUNCTION:
# --------------------------------------------------------
# # Table structure for table `prepositions` #
CREATE TABLE `prepositions` ( `id` int(3) NOT NULL auto_increment, `preposition` varchar(40) NOT NULL default '', `alsoCS` int(3) NOT NULL default '0', PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
# # Dumping data for table `prepositions` #
INSERT INTO `prepositions` VALUES (1, 'abaft', 0); INSERT INTO `prepositions` VALUES (2, 'about', 0); INSERT INTO `prepositions` VALUES (3, 'above', 0); INSERT INTO `prepositions` VALUES (4, 'according to', 0); INSERT INTO `prepositions` VALUES (5, 'across', 0); INSERT INTO `prepositions` VALUES (6, 'after', 1); INSERT INTO `prepositions` VALUES (7, 'against', 0); INSERT INTO `prepositions` VALUES (8, 'ahead of', 0); INSERT INTO `prepositions` VALUES (9, 'along', 0); INSERT INTO `prepositions` VALUES (10, 'along with', 0); INSERT INTO `prepositions` VALUES (11, 'alongside', 0); INSERT INTO `prepositions` VALUES (12, 'amid', 0); INSERT INTO `prepositions` VALUES (13, 'among', 0); INSERT INTO `prepositions` VALUES (14, 'amongst', 0); INSERT INTO `prepositions` VALUES (15, 'apart from', 0); INSERT INTO `prepositions` VALUES (16, 'around', 0); INSERT INTO `prepositions` VALUES (17, 'as', 1); INSERT INTO `prepositions` VALUES (18, 'as far as', 0); INSERT INTO `prepositions` VALUES (19, 'as well as', 0); INSERT INTO `prepositions` VALUES (20, 'at', 0); INSERT INTO `prepositions` VALUES (21, 'back of', 0); INSERT INTO `prepositions` VALUES (22, 'before', 1); INSERT INTO `prepositions` VALUES (23, 'behind', 0); INSERT INTO `prepositions` VALUES (24, 'below', 0); INSERT INTO `prepositions` VALUES (25, 'beneath', 0); INSERT INTO `prepositions` VALUES (26, 'beside', 0); INSERT INTO `prepositions` VALUES (27, 'between', 0); INSERT INTO `prepositions` VALUES (28, 'beyond', 0); INSERT INTO `prepositions` VALUES (29, 'but', 0); INSERT INTO `prepositions` VALUES (30, 'by', 0); INSERT INTO `prepositions` VALUES (31, 'concerning', 0); INSERT INTO `prepositions` VALUES (32, 'contrary to', 0); INSERT INTO `prepositions` VALUES (33, 'despite', 0); INSERT INTO `prepositions` VALUES (34, 'down', 0); INSERT INTO `prepositions` VALUES (35, 'during', 0); INSERT INTO `prepositions` VALUES (36, 'except', 0); INSERT INTO `prepositions` VALUES (37, 'excepting', 0); INSERT INTO `prepositions` VALUES (38, 'for', 0); INSERT INTO `prepositions` VALUES (39, 'from', 0); INSERT INTO `prepositions` VALUES (40, 'in', 0); INSERT INTO `prepositions` VALUES (41, 'in addition to', 0); INSERT INTO `prepositions` VALUES (42, 'in back of', 0); INSERT INTO `prepositions` VALUES (43, 'in front of', 0); INSERT INTO `prepositions` VALUES (44, 'in lieu of', 0); INSERT INTO `prepositions` VALUES (45, 'in place of', 0); INSERT INTO `prepositions` VALUES (46, 'in regard to', 0); INSERT INTO `prepositions` VALUES (47, 'in spite of', 0); INSERT INTO `prepositions` VALUES (48, 'in view of', 0); INSERT INTO `prepositions` VALUES (49, 'inside', 0); INSERT INTO `prepositions` VALUES (50, 'instead of', 0); INSERT INTO `prepositions` VALUES (51, 'into', 0); INSERT INTO `prepositions` VALUES (52, 'like', 0); INSERT INTO `prepositions` VALUES (53, 'near', 0); INSERT INTO `prepositions` VALUES (54, 'of', 0); INSERT INTO `prepositions` VALUES (55, 'off', 0); INSERT INTO `prepositions` VALUES (56, 'on', 0); INSERT INTO `prepositions` VALUES (57, 'on account of', 0); INSERT INTO `prepositions` VALUES (58, 'out', 0); INSERT INTO `prepositions` VALUES (59, 'out of', 0); INSERT INTO `prepositions` VALUES (60, 'outside', 0); INSERT INTO `prepositions` VALUES (61, 'over', 0); INSERT INTO `prepositions` VALUES (62, 'past', 0); INSERT INTO `prepositions` VALUES (63, 'rather than', 0); INSERT INTO `prepositions` VALUES (64, 'regarding', 0); INSERT INTO `prepositions` VALUES (65, 'round', 0); INSERT INTO `prepositions` VALUES (66, 'since', 1); INSERT INTO `prepositions` VALUES (67, 'through', 0); INSERT INTO `prepositions` VALUES (68, 'throughout', 0); INSERT INTO `prepositions` VALUES (69, 'till', 0); INSERT INTO `prepositions` VALUES (70, 'to', 0); INSERT INTO `prepositions` VALUES (71, 'together with', 0); INSERT INTO `prepositions` VALUES (72, 'toward', 0); INSERT INTO `prepositions` VALUES (73, 'towards', 0); INSERT INTO `prepositions` VALUES (74, 'under', 0); INSERT INTO `prepositions` VALUES (75, 'underneath', 0); INSERT INTO `prepositions` VALUES (76, 'until', 1); INSERT INTO `prepositions` VALUES (77, 'unto', 0); INSERT INTO `prepositions` VALUES (78, 'up', 0); INSERT INTO `prepositions` VALUES (79, 'up to', 0); INSERT INTO `prepositions` VALUES (80, 'upon', 0); INSERT INTO `prepositions` VALUES (81, 'versus', 0); INSERT INTO `prepositions` VALUES (82, 'via', 0); INSERT INTO `prepositions` VALUES (83, 'with', 0); INSERT INTO `prepositions` VALUES (84, 'with regard to', 0); INSERT INTO `prepositions` VALUES (85, 'within', 0); INSERT INTO `prepositions` VALUES (86, 'without', 0); INSERT INTO `prepositions` VALUES (87, 'worth', 0); INSERT INTO `prepositions` VALUES (88, 'with regards to', 0); INSERT INTO `prepositions` VALUES (89, 'aboard', 0); INSERT INTO `prepositions` VALUES (90, 'absent', 0); INSERT INTO `prepositions` VALUES (91, 'amidst', 0); INSERT INTO `prepositions` VALUES (92, 'astride', 0); INSERT INTO `prepositions` VALUES (93, 'atop', 0); INSERT INTO `prepositions` VALUES (94, 'besides', 0); INSERT INTO `prepositions` VALUES (95, 'following', 0); INSERT INTO `prepositions` VALUES (96, 'notwithstanding', 0); INSERT INTO `prepositions` VALUES (97, 'mid', 0); INSERT INTO `prepositions` VALUES (98, 'minus', 0); INSERT INTO `prepositions` VALUES (99, 'onto', 0); INSERT INTO `prepositions` VALUES (100, 'opposite', 0); INSERT INTO `prepositions` VALUES (101, 're', 0); INSERT INTO `prepositions` VALUES (102, 'subsequent to', 0); INSERT INTO `prepositions` VALUES (103, 'prior to', 0); INSERT INTO `prepositions` VALUES (104, 'next to', 0); INSERT INTO `prepositions` VALUES (105, 'near to', 0); INSERT INTO `prepositions` VALUES (106, 'owing to', 0); INSERT INTO `prepositions` VALUES (107, 'outside of', 0); INSERT INTO `prepositions` VALUES (108, 'on to', 0); INSERT INTO `prepositions` VALUES (109, 'in to', 0); INSERT INTO `prepositions` VALUES (110, 'inside of', 0); INSERT INTO `prepositions` VALUES (111, 'far from', 0); INSERT INTO `prepositions` VALUES (112, 'as to', 0); INSERT INTO `prepositions` VALUES (113, 'aside from', 0); INSERT INTO `prepositions` VALUES (114, 'because of', 0); INSERT INTO `prepositions` VALUES (115, 'close to', 0); INSERT INTO `prepositions` VALUES (116, 'due to', 0); INSERT INTO `prepositions` VALUES (117, 'by means of', 0); INSERT INTO `prepositions` VALUES (118, 'in accordance with', 0); INSERT INTO `prepositions` VALUES (119, 'on behalf of', 0); INSERT INTO `prepositions` VALUES (120, 'on top of', 0); INSERT INTO `prepositions` VALUES (121, 'in case of', 0); INSERT INTO `prepositions` VALUES (122, 'betwixt', 0); INSERT INTO `prepositions` VALUES (123, 'circa', 0); INSERT INTO `prepositions` VALUES (124, 'anti', 0); INSERT INTO `prepositions` VALUES (125, 'cum', 0); INSERT INTO `prepositions` VALUES (126, 'per', 0); INSERT INTO `prepositions` VALUES (127, 'qua', 0); INSERT INTO `prepositions` VALUES (128, 'sans', 0); INSERT INTO `prepositions` VALUES (129, 'vis-a-vis', 0); INSERT INTO `prepositions` VALUES (130, 'vis a vis', 0);
Oliver Wong - 27 Feb 2007 15:18 GMT [context is: picking out keywords -- "prepositions" -- out of already-parsed sentences.]
> # > # Table structure for table `prepositions` [quoted text clipped - 15 lines] > INSERT INTO `prepositions` VALUES (3, 'above', 0); > INSERT INTO `prepositions` VALUES (4, 'according to', 0); [and so on...]
You might want to make your preposition column indexable, as you will probably be searching through that column pretty frequently. And if the values in the preposition column are unique, you might want to drop the `id` column altogether and make your preposition column your primary key.
- Oliver
Ike - 27 Feb 2007 16:32 GMT > [context is: picking out keywords -- "prepositions" -- out of > already-parsed sentences.] [quoted text clipped - 5 lines] > > - Oliver As you can see from the code I enclosed, I actually just read it directly into an ArrayList and work with the ArrayList itself.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|