Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2007

Tip: Looking for answers? Try searching our database.

comparing files without using arrays

Thread view: 
saiji.raman@gmail.com - 16 Feb 2007 06:45 GMT
hi all,

    got a problem with comparing two large files(A file with 8000
words and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

   we tried it using arrays..

   but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

   can someone help us out with feasible methods for solving this
problem..

   Thanks..
denJs - 16 Feb 2007 09:09 GMT
You can use two arrays of any size as buffers for both files,
fill the buffers and compare the content in a loop.
Alex Hunsley - 16 Feb 2007 21:09 GMT
> You can use two arrays of any size as buffers for both files,
> fill the buffers and compare the content in a loop.

It's not a direct file comparison the OP is after ....
Gordon Beaton - 16 Feb 2007 09:30 GMT
> we get array out of bounds exception even after increasing the heap
> space(-Xmx512m) in run time..

If you're getting an array bounds exception then your problems aren't
due to lack of memory, they're caused by attempting to access beyond
the end of the array (regardless of its size).

An array of size N has indices 0 to N-1. Attempting to use an index
outside this range will cause an array bounds error.

/gordon

Signature

[ don't email me support questions or followups ]
g o r d o n  +  n e w s  @  b a l d e r 1 3 . s e

Gordon Beaton - 16 Feb 2007 09:49 GMT
> got a problem with comparing two large files(A file with 8000 words
> and another file of size 8 MB with some content in it.. the
> frequency of the 8000 words in the other file must be found)

I've addressed your array bounds exception in another post, but
thought I'd suggest a couple of things anyway...

Only the word list needs to be kept in memory, for example using a
Hashmap or other structure that lets you easily map words to counters.

The other file can be read and processed one line at a time. You never
need to hold more than one line's worth of information. From your
problem description, I can't see a need to read the entire file into
any structure if that's what you're doing now.

Read one line, tokenize it into words, then look up each word to
update the corresponding counter. Repeat with the next line, etc.

/gordon

Signature

[ don't email me support questions or followups ]
g o r d o n  +  n e w s  @  b a l d e r 1 3 . s e

Chris Dollin - 16 Feb 2007 09:51 GMT
>      got a problem with comparing two large files(A file with 8000
> words and another file of size 8 MB with some content in it.. the
[quoted text clipped - 4 lines]
>     but we get array out of bounds exception even after increasing the
> heap space(-Xmx512m) in run time..

(a) The array-indexing-exception should tell you /where/ it happened.
   Look at that piece of code.

(b) The exception happens because you're trying to us an index i that
   isn't 0 <= i < array.length. (Note: /less-than/ array.length.)
   Make sure the index makes sense.

(c) This has nothing to do with how much heap space you've allocated.
   If there wasn't enough room for the arrays, you would have got an
   OutOfMemory error.

(d) Your summary of what you want to do suggests /strongly/ that arrays
   are not the data structure you want, since you need to associate
   with each word from the word-file another value viz the count of
   times it occurs in the other-file. This to me suggests a different
   data type.    

Signature

Chris "electric hedgehog" Dollin
"It's just the beginning we've seen" - Colosseum, /Tomorrow's Blues/

Eric Sosman - 16 Feb 2007 13:04 GMT
> hi all,
>
[quoted text clipped - 6 lines]
>     but we get array out of bounds exception even after increasing the
> heap space(-Xmx512m) in run time..

    You are chasing the wrong fox.  ArrayIndexOutOfBoundsException
means that somewhere in your program there is an array of N things
(ints, Strings, whatever) and that you have tried to use it with
an index that is <0 or >=N.  It does not mean that your program is
running low on memory; that one produces OutOfMemoryError.

>     can someone help us out with feasible methods for solving this
> problem..

    Read the 8000-word file and enter each of its words as
the key in a Map<String,Integer> with the Integer value zero.
Then read the other file, word by word.  For each word, do
a get(word) on the Map.  If get(word) returns null, the word
is not one of the 8000 you care about, so ignore it.  On the
other hand if get(word) returns an Integer, add one to that
value and update the Map: put(word, biggerInteger).  At the
end, traverse the entire Map to retrieve the results.

Signature

Eric Sosman
esosman@acm-dot-org.invalid

Patricia Shanahan - 16 Feb 2007 16:51 GMT
...
>     Read the 8000-word file and enter each of its words as
> the key in a Map<String,Integer> with the Integer value zero.
[quoted text clipped - 4 lines]
> value and update the Map: put(word, biggerInteger).  At the
> end, traverse the entire Map to retrieve the results.

As an alternative to using Integer, I sometimes use a small Counter
class that wraps an int with increment() and get() operations. get() is
defined to return the number of times the Counter's increment() method
has been called.

Patricia
Mark Rafn - 16 Feb 2007 17:09 GMT
>    but we get array out of bounds exception even after increasing the
>heap space(-Xmx512m) in run time..

ArrayIndexOutOfBoundsException won't be fixed by increasing heap space.  It's
a bug in your code.
--
Mark Rafn    dagon@dagon.net    <http://www.dagon.net/>
Alex Hunsley - 16 Feb 2007 21:14 GMT
> hi all,
>
[quoted text clipped - 9 lines]
>     can someone help us out with feasible methods for solving this
> problem..

As the others have said, you have a bug in your program - you're trying
to access beyond the end of the array, at a guess.

As for a very basic strategy for your problem, read the 'word' file into
memory.  Put the words into a Set (e.g. HashSet)[1]. Then parse the 8meg
file - read 10k chunks at a time, for example (more efficient this way),
splitting each chunk into words, and check if each word in a chunk is in
the HashSet. If so, increase its count. (So you may want a HashMap, not
a HashSet, because then it would be easy to map from each word to its
frequency count.)

[1] Using HashSet this way gives you much better efficiency than the
naive way of storing the words in a simple array, and comparing every
word in the array each time!

lex
saiji.raman@gmail.com - 17 Feb 2007 07:44 GMT
thnx for ur suggestions... even i had the plan of opting for hashset
before.. Actually after finding the freqency array i got to apply
SVD(singular value decomposition), which needs a 2D array as an
input.

and what is that mapset all about??can i convert that into a 2d array
later by any means???

for now i hav used the array list concept..and converted that into 2d
array and hav applied svd(but i cant do this for large files)...
Boaz.Jan@gmail.com - 17 Feb 2007 13:09 GMT
sorry but i would like to know more about what you are trying to do,
you mentioned a compersion and a word freq. count
by compersion do you mean a lexigraphic compersion (char by char)? or
just compering the freq. of occurence of each word?

for checking the freq. of each word a HashTable(or any other
datastructure which provide you with the ability to set your own index
type an value with a corresponding value object) would be the best for
you
why? cause you can set the Index of each "Cell" as the Word itself
(and by that earning efficency)while the value is an Integer which
represents the number of occurnces (i recommend making a class that
contains an int memeber and have a method increment() instead of
creating an Integer instance for each incremetation. as said here
before).

inserting  -  put(Object key, Object value)
    Hashtable numbers = new Hashtable();
    numbers.put("one", new Integer(1));
    numbers.put("two", new Integer(2));
    numbers.put("three", new Integer(3));

retriving  - get(Objectt key)

    Integer n = (Integer)numbers.get("two");

for a help with the compersion itself i would require more info about
what you are trying to do and what are your limitations (unless all
the compersion is of the freq. of the words in both your files)

here is a ref to hashtable class doc
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Hashtable.html
Lew - 17 Feb 2007 15:56 GMT
> for checking the freq. of each word a HashTable

if you are in J2ME, HashMap for JSE.

- Lew
Lew - 17 Feb 2007 16:07 GMT
> retriving  - get(Objectt key)
>
>      Integer n = (Integer)numbers.get("two");

Good exposition. Suggestion: use generics.

  Map<String, Integer> freqs = new HashMap<String, Integer>();
  // or use Map<String, Wrapper> as Boaz.Jan suggested
  ...
  int n = freqs.get( "two" ); // Wrapper w = freqs.get("two");

With the Wrapper you can use clever idioms like

  Wrapper w;
  if ( (w = freqs.get( word )) == null )
  {
    freqs.put( word, new Wrapper() );
  }
  else
  {
    w.increment();
  }

- Lew
Boaz.Jan@gmail.com - 17 Feb 2007 16:35 GMT
> Boaz....@gmail.com wrote:
> > retriving  - get(Objectt key)
[quoted text clipped - 21 lines]
>
> - Lew

didnt want to enter the concept of generics on a post clearly abit
less advanced
so copied the non generic example from the jse api docs about
hashtable
and sorry about the Hashtable - old habbit :) (the Dictionary class is
now obsolete and therefor so is hashtable... sorry again :)http://
java.sun.com/j2se/1.5.0/docs/api/java/util/Dictionary.html)
Lew - 17 Feb 2007 17:02 GMT
> didnt want to enter the concept of generics on a post clearly abit
> less advanced
> so copied the non generic example from the jse api docs about

One could argue that generics are just as easy and less error-prone compared
to the type casts needed for raw types.

- Lew
Chris Uppal - 17 Feb 2007 17:31 GMT
> > didnt want to enter the concept of generics on a post clearly abit
> > less advanced
> > so copied the non generic example from the jse api docs about
>
> One could argue that generics are just as easy and less error-prone
> compared to the type casts needed for raw types.

It seems reasonable to assume that someone (the OP) who doesn't yet know about
HashMap and its friends will not yet know about Java generics either.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.