I've got a problem with Lucene and hopefully somebody can help.
I've indexed a whole html site using lucene and the htmlparser from
sourceforge. The problem is terms that appear in the page do not
bring back pages in the search results. For example, if I search for
"simon" I get 3 results back but when searching for "simon goldwyn" I
get 5 results back. Yet the word "simon" is in each of these 5 pages.
I can search for different terms that appear in the pages and they
are brought back in the results. It just appears to be certain terms
that are not brought back in the search. Has anybody any ideas? Or
seen this before? Is it likely to be a problem with Lucene or a
problem with the htmlparser that is parsing the content?
If anybody has got any ideas of what the problem could be or a way to
narrow down where the fault could lie it wouild be much appreciated.
MTIA
jim
this is just an educated guess I'm afraid (I've worked with search engines
before but not with lucene)
is there some kind of relevancy rank cutoff in effect? for instance,
searching for simon goldwyn would probably give that term a higher relevancy
than just searching for the word simon. this might explain why the term
simon on it's own doesn't appear. even if you're not specifying a minimum
score, it maybe that there is a default value somwhere.
Something else that springs to mind is a feature that Autonomy had (or has).
if there are two words next to each other that start with a capital letter
(in the original text, not the search term), it can be configured to index
them the pair as a single term "simongoldwyn". This gives very high ranking
to proper names when searching for both parts of the name.
Andy
> I've got a problem with Lucene and hopefully somebody can help.
>
[quoted text clipped - 15 lines]
>
> jim