Hi,
I need a regular expression that will match a word stem to that stem
PLUS all common suffixes.
'make' should match makes, meker, makings etc.
A slight twist is that I am using the Porter stemmer so some of the
stems are not real words.
An example is "say" which has the stem of 'sai'.
Here's what I'm using right now;
Pattern p =
Pattern.compile("\\b("+item+"{1,2}|(("+item+"{0})[eiy]?|\1))(es|er|e?d|en|y|ing||ness|ional)?s?\\b",
Pattern.CASE_INSENSITIVE);
"item" is the stem which I am trying to match in some unstemmed text.
match the item with the last letter possibly doubled
or
the item minus its last letter and optionally ending in i,y or e
but in that case only match the item{0} in \1
and
with any of the optional endings;es, er, ed, etc.
and
possibly ending in s (for makings, makers and the like.)
The trouble is that me regex matches things I don't want it to match
The stem 'sai' (say) matches 'sad', for example, because the
("+item+"{0})[eiy]?|\1 part strips the i off the end and tthen finds
that sa(e?d) is a match.
Thanks
P.
hiwa - 25 Aug 2006 23:46 GMT
duffy.paul@comcast.net のメッセージ:
> Hi,
>
[quoted text clipped - 32 lines]
> Thanks
> P.
If you have a hammer, you might see everything as nails.
But I'm afraid stem/suffix parsing issue is too complex a thing to be a
nail.
Principal weakness of regular expression is that it can't handle
conditionals.
Paul D - 26 Aug 2006 12:01 GMT
> duffy.paul@comcast.net のメッセージ:
>
[quoted text clipped - 39 lines]
> Principal weakness of regular expression is that it can't handle
> conditionals.
Thanks, I added conditionals for a few of the more common cases and it
works MUCH better. Still have a few unintentional matches: 'moth'
matches 'mother'. Breaking up the big regex into smaller pieces also
made if faster.
Paul D - 26 Aug 2006 16:52 GMT
>> duffy.paul@comcast.net のメッセージ:
>>
[quoted text clipped - 45 lines]
> matches 'mother'. Breaking up the big regex into smaller pieces also
> made if faster.
Actually, just testing that the search term and the candidate target
have the same Porter stem gives me what I need. The stems of mother and
moth are not the same.
Then the long regex can be replaced by a much fuzzier one:
"\\b"+item+"{0,1}[a-z]{0,10}\\b"
Chris Uppal - 26 Aug 2006 10:00 GMT
> I need a regular expression that will match a word stem to that stem
> PLUS all common suffixes.
Then you are out of luck...
Stemming is a complex, and highly heuristic, algorithm and is not a suitable
application for regexps. (Indeed, very little /is/ a suitable application for
regexps -- I wish they had never been added to the standard library).
-- chris