Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2007

Tip: Looking for answers? Try searching our database.

Java heap and big database queries

Thread view: 
Patricia Shanahan - 10 Feb 2007 17:59 GMT
I'm using Java to query a MySQL database through com.mysql.jdbc.Driver.

My Java program calculates a relatively small summary of the query
result, and I'm sure my Java structure would fit in the default heap memory.

However, I run out of heap memory while running with -Xmx700m.

The program runs perfectly on a million line subset of the database, but
fails on the full database, over 88 million lines.

My query is:

String getData = "select count(*), "
+ "sum(net_amt), "
+ "exists(select * from stores where stores.store_id =
transactions.store_id) "
+ "from transactions group by txn_id;";

...

Statement st = con.createStatement();
ResultSet result = st.executeQuery(getData);

There are about 10,160,053 distinct values of txn_id.

Any ideas for what I may be doing wrong? Suggestions for how to obtain
and scan a large query result without Java heap problems?

Thanks,

Patricia
Chris Uppal - 10 Feb 2007 19:22 GMT
> The program runs perfectly on a million line subset of the database, but
> fails on the full database, over 88 million lines.

See this page:

http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-note
s.html


and scan down about 2/3 to a sub-section titled "ResultSet".

(Made my hair stand on end...  There is no way, just /no way/, I am ever going
to choose MySQL for /anything/.)

   -- chris
Patricia Shanahan - 10 Feb 2007 19:44 GMT
>> The program runs perfectly on a million line subset of the database, but
>> fails on the full database, over 88 million lines.
[quoted text clipped - 4 lines]
>
> and scan down about 2/3 to a sub-section titled "ResultSet".

Thanks. That certainly matches the symptoms I'm seeing.

> (Made my hair stand on end...  There is no way, just /no way/, I am ever going
> to choose MySQL for /anything/.)

I picked MySQL because I had worked on a project that used it, but
for far smaller tables, and the DB comparison web pages I found seemed
to indicate it was reasonable performance for large tables.

I'm very open to suggestions for better choices, especially after this
experience. I need free or cheap for academic and personal use. No need
for commercial use.

Patricia
Paul Tomblin - 10 Feb 2007 19:52 GMT
In a previous article, Patricia Shanahan <pats@acm.org> said:
>I'm very open to suggestions for better choices, especially after this
>experience. I need free or cheap for academic and personal use. No need
>for commercial use.

Postgresql implemented more of the SQL standard (although MySQL has caught
up in the last couple of releases as long as you use InnoDB) and is really
free, even for commercial use

Signature

Paul Tomblin <ptomblin@xcski.com> http://blog.xcski.com/
Last I checked, it wasn't the power cord for the Clue Generator
that was sticking up your a.s.
             -- John Novak

Chris Smith - 10 Feb 2007 20:31 GMT
> I'm very open to suggestions for better choices, especially after this
> experience. I need free or cheap for academic and personal use. No need
> for commercial use.

I second the recommendation for PostgreSQL.  I have used it extensively
from Java applications with large amounts of data, and have run into few
problems.  I did run into a quirk in the 7.x release cycle, but I was
able to easily patch the JDBC driver, and the problem is completely
solved with the JDBC drivers for the 8.x versions.

PostgreSQL still does not implement all of JDBC perfectly.  In
particular, PostgreSQL does fall back to pre-fetching more complex
result sets (e.g., scrollable result sets).  A full list of limitations
is available at http://jdbc.postgresql.org/todo.html

That said, I am not aware of any low-cost database that does better than
PostgreSQL.  It is far better than anything else I've tried.

Signature

Chris Smith

Arne Vajhøj - 10 Feb 2007 20:59 GMT
> I'm very open to suggestions for better choices, especially after this
> experience. I need free or cheap for academic and personal use. No need
> for commercial use.

MySQL, PostgreSQL and FireBird are the usual suspects for a
free database server.

Arne
David Segall - 11 Feb 2007 04:52 GMT
>>> The program runs perfectly on a million line subset of the database, but
>>> fails on the full database, over 88 million lines.
[quoted text clipped - 17 lines]
>experience. I need free or cheap for academic and personal use. No need
>for commercial use.
I can't compare them but I have what I believe to be a complete list
of freely distributable "heavy duty" databases at
<http://database.profectus.com.au/>. Of these, MySql has the most
restrictive license and that includes IBM, Oracle and Microsoft.
Arne Vajhøj - 12 Feb 2007 00:21 GMT
> I can't compare them but I have what I believe to be a complete list
> of freely distributable "heavy duty" databases at
> <http://database.profectus.com.au/>. Of these, MySql has the most
> restrictive license and that includes IBM, Oracle and Microsoft.

The text for MySQL is rather imprecise.

You can use MySQL all you want with closed source.

You can use the MySQL connectors all you want with closed source.

As long as you do not distribute.

If you distribute your stuff and it is "linked" with MySQL
code then then you need to buy a commercial license or
make your stuff GPL (or at least FLOSS when it is MySQL).

How to define "linked" can be a bit fuzzy.

But it is not something special for MySQL - this is
how GPL works.

It is rather unusual to use GPL for a library - often
LGPL (or an Apache/BSD style license) is chosen.

The true GPL believers will not consider that a problem.

But there are a reason that Apache/BSD sometimes is called
business friendly open source license.

Arne
David Segall - 12 Feb 2007 13:09 GMT
>If you distribute your stuff and it is "linked" with MySQL
>code then then you need to buy a commercial license or
[quoted text clipped - 4 lines]
>But it is not something special for MySQL - this is
>how GPL works.
I am not qualified to interpret the GPL license but MySQL AB's
interpretation contrasts with the thousands of commercial applications
that are "linked" to Linux and dozens of other GPLd programs. Since,
as far as I am aware, these problems don't arise with the other freely
distributable databases I list at <http://database.profectus.com.au>,
including those from Oracle and Microsoft, it seems foolish to choose
MySQL.
Arne Vajhøj - 13 Feb 2007 01:58 GMT
>> If you distribute your stuff and it is "linked" with MySQL
>> code then then you need to buy a commercial license or
[quoted text clipped - 11 lines]
> including those from Oracle and Microsoft, it seems foolish to choose
> MySQL.

PostgreSQL uses BSD license.

Firebird uses a derivative of MPL.

Both are "business friendly".

Linux is not a problem for two reasons:
  - applications usually does not interact directly
    with Linux but uses GLIBC which is LGPL
  - Linus Torvalds has a definition of linking that
    is a bit narrower than MySQL's

No huge contrast - just a small difference.

If your page only is for those that produce closed
source to distribute to others, then you are correct,
but should probably state so on the page so other
people does not waste their time reading your advice.

Arne
Arne Vajhøj - 13 Feb 2007 02:02 GMT
> If your page only is for those that produce closed
> source to distribute to others, then you are correct,
> but should probably state so on the page so other
> people does not waste their time reading your advice.

Well actually you do so.

So the above is not correct.

But I still do not like it - all the relevant
keywords: GPL, linking and distribution is missing.

Arne
David Segall - 13 Feb 2007 14:09 GMT
>> If your page only is for those that produce closed
>> source to distribute to others, then you are correct,
[quoted text clipped - 9 lines]
>
>Arne
David Segall - 13 Feb 2007 15:18 GMT
>> If your page only is for those that produce closed
>> source to distribute to others,
I cannot accept "only" in that sentence. My intention was that
_everyone_ could use the software I listed in any way they chose.
>>then you are correct,
>> but should probably state so on the page so other
[quoted text clipped - 6 lines]
>But I still do not like it - all the relevant
>keywords: GPL, linking and distribution is missing.

Arne, I want you to like it! If I can achieve this by adding some
brief, relevant information to the web page I would be pleased to do
so and would appreciate it if you would post the information here or
email it to david at segall dot net. I am reluctant to include any
detailed licensing information because each of the products listed
comes with several pages of legal restrictions that I do not wish to
maintain.

I added the licensing qualifications to the MySQL entry in my list
because I was concerned by a previous post from Chris Smith that
contained the information that he has posted in this thread. Because
MySQL is so popular and many people, encouraged by MySQL AB, consider
it to be freely distributable I chose to qualify the entry in my list
rather than simply delete it.
Lew - 13 Feb 2007 21:34 GMT
> Because
> MySQL is so popular and many people, encouraged by MySQL AB, consider
> it to be freely distributable I chose to qualify the entry in my list
> rather than simply delete it.

MySQL has two licenses, one of which is the GPL. It is no more nor less
restricted than other GPLed code.

What people "consider" does not determine the license. The license terms
published by MySQL AB determine the license. You could go straight to them for
the definitive information instead of relying on rumor.

- Lew
Arne Vajhøj - 14 Feb 2007 00:51 GMT
> MySQL has two licenses, one of which is the GPL. It is no more nor less
> restricted than other GPLed code.
>
> What people "consider" does not determine the license. The license terms
> published by MySQL AB determine the license. You could go straight to
> them for the definitive information instead of relying on rumor.

Which is not true.

MySQL choose the GPL license.

It is not up to them to interpret the GPL license.

FSF must be the authoritative source on how the GPL
license should be interpreted.

Especially since MySQL has an economic interest in
FUD'ing a bit.

Arne
Chris Smith - 14 Feb 2007 15:51 GMT
> MySQL has two licenses, one of which is the GPL. It is no more nor less
> restricted than other GPLed code.

In a legal sense, that's true.  I absolutely agree that the GPL doesn't
say what MySQL says that it does.  Their interpretation is hard to
believe in terms of the text of the agreement (IANAL), and common sense
tells us they are completely loony.

Nevertheless, do you want to defend yourself in court if they sue you?  
That's my main concern.  There are a lot of crazy people with whom I
avoid any kind of relationship.  Even if they can't legally do anything
to hurt me, they can cause quite a few problems along the way.

(My other concern is that their web page misleads people on what the
GPL, in general, does.  That seems to be rather widespread, though.  The
whole free software world often seems to subscribe to the "wishful
thinking" approach to interpreting legal agreements.)

Signature

Chris Smith

David Segall - 15 Feb 2007 14:48 GMT
>What people "consider" does not determine the license. The license terms
>published by MySQL AB determine the license. You could go straight to them for
>the definitive information instead of relying on rumor.
Why bother? Chris Smith has pointed out that MySQL AB are deliberately
vague about their license and Arne Vajhøj has put forward a good case
for interpreting their license as a "legally defined" GPL licence.

I have not seen _any_ argument for preferring MySQL over the other
databases I listed at http://database.profectus.com.au. Why risk a
legal battle when you have several alternatives that do not attempt to
restrict how you use the database?
Chris Uppal - 15 Feb 2007 19:17 GMT
> I have not seen _any_ argument for preferring MySQL over the other
> databases I listed at http://database.profectus.com.au. Why risk a
> legal battle when you have several alternatives that do not attempt to
> restrict how you use the database?

I have my own opinions on the legal matters, but I think there is -- or at
least may be -- a technical argument for MySQL (in this case).

MySQL is, as far as I can tell, alone amongst the heavyweight databases in that
it can run without full ACID support (and may even defaults to doing so!).
Normally I would say that was a huge argument /against/ MySQL, but for the
particular application under discussion (Patricia's), I think that might be an
advantage.  It ought to allow MySQL to run faster, perhaps significantly so.

Whether there is any real advantage, I have no idea.  But I have assumed (maybe
wrongly) for a long time that the frequently heard justification: "I use MySQL
'cos it's lightning fast", is (if it's not pure myth) a consequence of running
without a the safety net that sensible DBMSs provide.

   -- chris
Patricia Shanahan - 15 Feb 2007 23:11 GMT
>> I have not seen _any_ argument for preferring MySQL over the other
>> databases I listed at http://database.profectus.com.au. Why risk a
[quoted text clipped - 14 lines]
> 'cos it's lightning fast", is (if it's not pure myth) a consequence of running
> without a the safety net that sensible DBMSs provide.

Avoiding ACID overhead was one of my thoughts in picking MySQL originally.

However, the non-streaming of query results was very discouraging. It's
not just the immediate problem. It is the implication that the way I
want to work is not something MySQL is tuned for.

My Ph.D. adviser has also come down in favor of PostgreSQL, so I would
need a really strong reason to go the other way.

I had a few days delay while I did some other work, but now I'm in the
process of reconstructing the database. I'll soon find out how well it
works in practice.

Patricia
Mike  Schilling - 15 Feb 2007 23:23 GMT
> Avoiding ACID overhead was one of my thoughts in picking MySQL originally.
>
> However, the non-streaming of query results was very discouraging.

And the alternative being to transmit one row at a time isn't promising
either; for a large dataset, network overhead will dwarf any other
processing you might do.

> It's not just the immediate problem. It is the implication that the way I
> want to work is not something MySQL is tuned for.

Or that MySQL isn't tuned, period.
Mark Thornton - 16 Feb 2007 09:22 GMT
Mike Schilling wrote:

>>Avoiding ACID overhead was one of my thoughts in picking MySQL originally.
>>
[quoted text clipped - 3 lines]
> either; for a large dataset, network overhead will dwarf any other
> processing you might do.

Transmitting rows a batch at a time (say 100) avoids that problem and
also has only a modest heap impact.

Mark Thornton
Mike Schilling - 16 Feb 2007 17:23 GMT
> Mike Schilling wrote:
>>
[quoted text clipped - 9 lines]
> Transmitting rows a batch at a time (say 100) avoids that problem and also
> has only a modest heap impact.

Yes.  That MySQL doesn't batch rows is a sign of its primitiveness.
Mark Thornton - 16 Feb 2007 19:41 GMT
>>Transmitting rows a batch at a time (say 100) avoids that problem and also
>>has only a modest heap impact.
>
> Yes.  That MySQL doesn't batch rows is a sign of its primitiveness.

Indeed. However this behaviour is common to quite a lot of JDBC
driver/database combinations. Apparently it suits common enterprise
database activities. It is rather less suitable to
scientific/mathematical  use of databases. In these cases you would have
to be barking mad to perform the computation in SQL, but the alternative
involves retrieving a very substantial amount of data. There are also
other issues with this kind of use, notably in locking / transactions.

SQL Server with a suitable driver will stream results. For smallish
datasets you can use SQL Server Express for free (assuming you are
running on Windows).

FireBird is another option worth considering.

Mark Thornton
Patricia Shanahan - 17 Feb 2007 23:06 GMT
>>> Transmitting rows a batch at a time (say 100) avoids that problem and
>>> also has only a modest heap impact.
[quoted text clipped - 8 lines]
> involves retrieving a very substantial amount of data. There are also
> other issues with this kind of use, notably in locking / transactions.

Ultimately, I will be using this data e.g. as training data in my Ph.D.
research project, so I do feel more comfortable knowing how to get the
data to my Java program.

> SQL Server with a suitable driver will stream results. For smallish
> datasets you can use SQL Server Express for free (assuming you are
[quoted text clipped - 3 lines]
>
> Mark Thornton

SQL Server Express is limited to 4 GB total database size, and one of my
tables is about 10 GB, so it is not suitable for my particular situation.

Meanwhile, I've moved my database to PostgreSQL. I had some initial
performance problems with a large query, but I was able to cure it by
increasing the sort buffer size.

It streams data to my Java program, so I no longer run out of memory.

Thanks to everyone for all the advice.

Patricia
Arne Vajhøj - 16 Feb 2007 02:32 GMT
> MySQL is, as far as I can tell, alone amongst the heavyweight databases in that
> it can run without full ACID support (and may even defaults to doing so!).
[quoted text clipped - 6 lines]
> 'cos it's lightning fast", is (if it's not pure myth) a consequence of running
> without a the safety net that sensible DBMSs provide.

I think that is correct.

Arne
Arne Vajhøj - 19 Feb 2007 02:26 GMT
> Arne, I want you to like it! If I can achieve this by adding some
> brief, relevant information to the web page I would be pleased to do
[quoted text clipped - 10 lines]
> it to be freely distributable I chose to qualify the entry in my list
> rather than simply delete it.

Maybe something like:

You should be aware that the MySQL ODBC, JDBC and ADO.NET drivers
is only freely available under GPL license (not LGPL license) with
a FLOSS exception. The GPL license states that code "linked" to
GPL code must be GPL code itself. If your code is GPL or other
open source, then you do not have any problems. If your code
is closed source, then you should seek legal advice before
using those drivers.

Arne
Chris Smith - 13 Feb 2007 04:26 GMT
> The text for MySQL is rather imprecise.
>
[quoted text clipped - 3 lines]
>
> As long as you do not distribute.

See http://www.mysql.com/company/legal/licensing/commercial-license.html 
for information, though, on MySQL's definition of "distribute".  It may
surprise you.  If you write them for clarification, their answers may do
even more to surprise you.  MySQL feels that any time that anyone might
possibly use your application with MySQL, you should pay them money.  
They call this distribution, despite the obviously contrary common
meaning of that word.  I was told, at one point, that they believe this
even if all I do is write an application to conform to a standard API
such as JDBC.  In their view, if you write a portable database
application, and later someone else installs MySQL and uses your
application with it, you owe them money.  Apparently, they feel that
JDBC, ODBC, and other APIs are all incorporated into their product
because they provide an implementation of the APIs.

Could you fight them on this and win in court?  I don't know.  Most
likely, it depends on how much money you have, among other factors.  
Ultimately, though, you're ethically bound to respect their decisions
with regard to the distribution of their software.  You are also well-
advised from a practical perspective not to deal with anyone who is so
deluded and hires lawyers on a regular basis.

I just wish they would stop their campaign of misinformation about the
GPL license.  Everything else they do is easy to avoid quite neatly.  
PostgreSQL works fine, so use it.

Signature

Chris Smith

Arne Vajhøj - 14 Feb 2007 00:56 GMT
>> The text for MySQL is rather imprecise.
>>
[quoted text clipped - 22 lines]
> Ultimately, though, you're ethically bound to respect their decisions
> with regard to the distribution of their software.

No.

They decided to use the GPL license.

We need to respect the GPL license.

I think it will be considered extremely unethical by the GPL
software community to allow "extra conditions" on GPL'ed code.

If they want a MySQL Open Source License with a twist
compared to standard GPL, then they should make one.

Arne
Mike Schilling - 11 Feb 2007 19:54 GMT
> I'm very open to suggestions for better choices, especially after this
> experience. I need free or cheap for academic and personal use. No need
> for commercial use.

Piggybacking in Patricia's question, I'm curious if anyone has experience
with the newly open-sourced Ingres.  (The commerical version, not University
Ingres.)
aloha.kakuikanu - 15 Feb 2007 23:31 GMT
> I'm using Java to query a MySQL database through com.mysql.jdbc.Driver.
>
[quoted text clipped - 23 lines]
> Any ideas for what I may be doing wrong? Suggestions for how to obtain
> and scan a large query result without Java heap problems?

It can be argued that it is entirely possible to design your client
application is such a way that it never is required to fetch huge
result set. After all query results are presented to the end user, who
is capable digesting only rather limited amount of data. In your
example, why do you need to do partial aggregation and fetch
10,160,053 values, especially that you projected away txn_id column?
Patricia Shanahan - 16 Feb 2007 00:10 GMT
...
> It can be argued that it is entirely possible to design your client
> application is such a way that it never is required to fetch huge
> result set. After all query results are presented to the end user, who
> is capable digesting only rather limited amount of data. In your
> example, why do you need to do partial aggregation and fetch
> 10,160,053 values, especially that you projected away txn_id column?
...

I'm doing data mining, not transaction processing. I am the end user.

If I were more skilled at SQL than Java, I might do more of the
processing in SQL, and have smaller results sets. As it is, I've been
using SQL for a few days, and Java for years...

I know, for example, exactly how to generate histogram tables in Java,
in a form that can be read into a spreadsheet for display.

Patricia
Vadim Tropashko - 16 Feb 2007 01:26 GMT
> I'm doing data mining, not transaction processing. I am the end user.

More power to SQL, then.

> If I were more skilled at SQL than Java, I might do more of the
> processing in SQL, and have smaller results sets. As it is, I've been
> using SQL for a few days, and Java for years...
>
> I know, for example, exactly how to generate histogram tables in Java,
> in a form that can be read into a spreadsheet for display.

Here is sample book chapter with section on histograms http://
vadimtropashko.wordpress.com/files/2007/02/ch3.pdf
The whole SQL patterns book is here
http://www.rampant-books.com/book_2006_1_sql_coding_styles.htm
aloha.kakuikanu - 16 Feb 2007 03:00 GMT
> If I were more skilled at SQL than Java, I might do more of the
> processing in SQL, and have smaller results sets. As it is, I've been
> using SQL for a few days, and Java for years...

What people usually when they find a challenging query, is they post
it to a proper forum. You write a short description of a problem, a
sample data, and the expected output. Unlike typical java help
request, each SQL query is like a tiny math puzzle. This is why there
would be plenty of people who are eager to take on your challenge.

Unfortunately, the big 3 (and open source) communities has grown
apart, so that there is no single the most appropriate spot where to
post generic SQL problems. For Oracle the best place would be OTN, for
SQLServer public.microsoft.sqlserver.programming. I don't know what is
the best for MySQL.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.