Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2006

Tip: Looking for answers? Try searching our database.

Java Basic Search engine

Thread view: 
ibrahimover@gmail.com - 04 Dec 2006 14:42 GMT
Hi

I have almost 100 html pages on my local disk and  first i have to
index them then  i have to make simple word search on that index to
find which pages include that word

i dont know how to index i mean which algorithm  b+tree or smthng else
and what to index {word,  page url, ...}

does anybody have simple code to understand what to index and how to
index ,  how to search writen with java

any help would be great for me

Thank You
Luc The Perverse - 04 Dec 2006 15:19 GMT
> Hi
>
[quoted text clipped - 9 lines]
>
> any help would be great for me

There are existing tools out there that are better than anything you are
going to be able to write.

If this is a learning exercise - then any kind of balanced tree is going to
be fine unless you wanted to try some kind of hashing to try to save some
time.

If you are hashing and sorting you will probably benefit from converting
your search queries into numbers early on through some kind of secure
hashing algorithm (SHA-1 for example)  then you could choose an arbitrary
numbers of bits to divide into separate hash groups (if you choose 8 bits
then you would have 256) each of which contains a tree which would search
for your 128 bit entry.  Every search query points back to 0 to many HTML
pages . . . each hit corresponds to 1 to many HTML pages.

Repeat for every item in your list, and then compute your "best" results as
you see fit.

--
LTP

:)
Chris Uppal - 04 Dec 2006 16:01 GMT
> I have almost 100 html pages on my local disk and  first i have to
> index them then  i have to make simple word search on that index to
> find which pages include that word

If you aren't doing this as some sort of exercise, then you would probably find
it easier to use a pre-packaged search/indexing engine such as Apache Lucene.

If you /are/ doing it as an exercise, then text searching and indexing is a big
topic and I don't really know where to start.  Perhaps you should ask your
teacher (if you have one) for guidance and more detail about what you are
supposed to do.

   -- chris
ibrahimover@gmail.com - 04 Dec 2006 17:07 GMT
> > I have almost 100 html pages on my local disk and  first i have to
> > index them then  i have to make simple word search on that index to
[quoted text clipped - 9 lines]
>
>     -- chris

Thanks for answer

Im doing for exercise    infect this isnt my exercise.. but if i succes
this most part will be done other parts are just usual reports etc..

i hope i can find someone to guide me here
Luc The Perverse - 05 Dec 2006 02:07 GMT
>> > I have almost 100 html pages on my local disk and  first i have to
>> > index them then  i have to make simple word search on that index to
[quoted text clipped - 20 lines]
>
> i hope i can find someone to guide me here

He did . . . He suggested Apache Lucene

--
LTP

:)
Julian Treadwell - 05 Dec 2006 04:17 GMT
>>>> I have almost 100 html pages on my local disk and  first i have to
>>>> index them then  i have to make simple word search on that index to
[quoted text clipped - 26 lines]
>
> :)

If you're doing it as an exercise, then you will need to follow these
basic steps:

(1) Write a program that will extract all appropriate words from your
web pages (exclude HTML tags and short words like "a") and build a
cross-reference table of these words against the pages they reside in.
This table should be in some database, doesn't much matter whether it's
XML or MySQL or whatever.  In the real world this program should be run
at least nightly to keep the database up-to-date.

(2) Create a search function on your main page to this database which
will allow the user to check this table for a particular word and then
allow him to link to any pages found.

Good luck,

Julian
ibrahimover@gmail.com - 05 Dec 2006 18:25 GMT
Hi all..

Thanx for all answers first of all i  looked for all tools that u
suggested  Swishe-e  looks great im supposed to  do very simple one
like swish- so they are good  example for me

As Julian said  i started that steps

First Parse Html then  exclude tags and unwanted words  than  index
them
the question is how to index  onemore point is  i dont know how to
explain mybe this example helps. {  if that page has a word like
investigation  i have a tool which seperate that word to
investigation-investigate-investigate  and i will index that link to
this words }
im planning such a structure so that if someone search investigation
first results will be investigation then investigate....)

hope im clear
till  now everything is ok   the question is  indexing algorithm and
what else to index?  it shouldnt be too complicated  maybe  one more
importand thing i should index  again i will give example

if we search  "simple investigation"   in first results the pages which
has "simple investigation"  should came  and then
"simple,,,,,,,,,,,,,,,,,,,,,,,,,,,,, investigation"

so only this two criteria is importand for me  thats why i should find
a such a kind of indexing algorithm

Thank You

> >>>> I have almost 100 html pages on my local disk and  first i have to
> >>>> index them then  i have to make simple word search on that index to
[quoted text clipped - 44 lines]
>
> Julian
Julian Treadwell - 06 Dec 2006 00:37 GMT
> Hi all..
>
[quoted text clipped - 74 lines]
>>
>> Julian

One way to allow phrase searching would be to include the word position
in your index table.

So the table structure would be:

field1: word (key)
field2: page # (multi-value)
field3: position (multi-value,linked to page)

So if the user searches for "simple investigation" and your search
program found "simple" on page 100 at position 32 and "investigation" at
page 100 at position 33 it could decide there's a phrase match and list
page 100 at the top of the list.
ibrahimover@gmail.com - 06 Dec 2006 12:40 GMT
Hi i forget to say  there is a problem

im not alloved to use any DB so i have to solev this issue by text
files

im planning to  make an index file which has  smthng like

investigate|5
investigation|7
field|56
...

smthng like this  i should orders this words in some order like b tree
than search "investigation " on that index when i find that i will get
the poineter "7"
but the problem is  i dont want to build btree everytime so i guess i
have to know how to implement btree over text file how to
add/delete/search     instead of in memory  but im not sure just
thought with my little knowladge

than in onother object file  the structure about "investigation" like
page#,position,.. will be in 7 th object so that with one search i can
go directly to 7th object and get informations about it

another way that i thought is dictionary isting  idont know much about
it but i think its smthng like
invest
       ---igate(5)
       ---igation(7)
       ---igator(88)
etc so that first indexing like this would be hard but later its easy
to search  but this time i dont know how to  save that indexing on file

i guess im confused :(

> One way to allow phrase searching would be to include the word position
> in your index table.
[quoted text clipped - 9 lines]
> page 100 at position 33 it could decide there's a phrase match and list
> page 100 at the top of the list.
Andrew Thompson - 06 Dec 2006 13:12 GMT
> Hi i forget to say  there is a problem
>
> im not alloved to use any DB so i have to solev this issue by text
> files

Huh?

Do you mean your boss said "Don't use a database!"
Why would the boss care, so long as it does
not cost anything?

Or, is it that you are teaching yourself Java, and
set the (arbitrary) rule that this code would not
use a database?

OTOH, if this is a college assignment, just how
much do you expect to learn by asking...
"does anybody have simple code to understand
what to index and how to index , how to search
writen with java "?

Something else, stranger altogether??

Andrew T.
ibrahimover@gmail.com - 06 Dec 2006 14:36 GMT
Hi thanx for answer even doesnt seems helpfull to me

as i said befor

"Im doing for exercise    infect this isnt my exercise.. but if i
succes
this most part will be done other parts are just usual reports etc..
"
as u can guess im student and if you read my last post i have some idea
to do but im not an expert and i dont want to waste lot time by trying
useless or worst algorithms  im not asking for b tree code or smthng
else just want to find the best way to do  and i guess asking helps me
to find

if its not the way how it goes here im very sory  i just thought i may
get some ideas some guide

> > Hi i forget to say  there is a problem
> >
[quoted text clipped - 20 lines]
>
> Andrew T.
Julian Treadwell - 07 Dec 2006 01:10 GMT
> Hi thanx for answer even doesnt seems helpfull to me
>
[quoted text clipped - 36 lines]
>>
>> Andrew T.

You can use an alphabetically ordered text file instead of a database to
store your word index but you'll have to read through it sequentially
each time you do a search instead of doing a direct read.  But with
modern computer speeds that won't be noticeable.      You'll need to
have a line for each occurrence of each word.
Julian Treadwell - 05 Dec 2006 04:01 GMT
You could look at this:

http://swish-e.org/

Might be what you're after.

> Hi
>
[quoted text clipped - 11 lines]
>
> Thank You


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.