Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / March 2004

Tip: Looking for answers? Try searching our database.

Lucene search problem

Thread view: 
jim - 10 Mar 2004 12:46 GMT
I've got a problem with Lucene and hopefully somebody can help.  

I've indexed a whole html site using lucene and the htmlparser from
sourceforge.  The problem is terms that appear in the page do not
bring back pages in the search results.  For example, if I search for
"simon" I get 3 results back but when searching for "simon goldwyn" I
get 5 results back.  Yet the word "simon" is in each of these 5 pages.
I can search for different terms that appear in the pages and they
are brought back in the results.  It just appears to be certain terms
that are not brought back in the search.  Has anybody any ideas?  Or
seen this before? Is it likely to be a problem with Lucene or a
problem with the htmlparser that is parsing the content?

If anybody has got any ideas of what the problem could be or a way to
narrow down where the fault could lie it wouild be much appreciated.

MTIA

jim
Andy Fish - 10 Mar 2004 14:32 GMT
this is just an educated guess I'm afraid (I've worked with search engines
before but not with lucene)

is there some kind of relevancy rank cutoff in effect? for instance,
searching for simon goldwyn would probably give that term a higher relevancy
than just searching for the word simon. this might explain why the term
simon on it's own doesn't appear. even if you're not specifying a minimum
score, it maybe that there is a default value somwhere.

Something else that springs to mind is a feature that Autonomy had (or has).
if there are two words next to each other that start with a capital letter
(in the original text, not the search term), it can be configured to index
them the pair as a single term "simongoldwyn". This gives very high ranking
to proper names when searching for both parts of the name.

Andy

> I've got a problem with Lucene and hopefully somebody can help.
>
[quoted text clipped - 15 lines]
>
> jim


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.