Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2006

Tip: Looking for answers? Try searching our database.

Determine datatype

Thread view: 
max_ny - 12 Jan 2006 22:11 GMT
Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

thanks
Thomas Kellerer - 12 Jan 2006 22:30 GMT
max_ny wrote on 12.01.2006 23:11:
> Hii,
> i am trying write some code to transfer data from csv file to a
[quoted text clipped - 7 lines]
> results); i can't figure out how should i determine the type of each
> column data

There is no way you can do it 100%. Any character sequence you read
could be mapped to several data types.

Take the sequence "20060112". This could be a VARCHAR column that
happens to have no characters (only numbers), or it could be an integer
value, or it could be a float value that happens to have no decimal
digits, or it could be a date.

I don't see a different solution then to let your user provide the
necessary information.

Thomas
h4hrr - 12 Jan 2006 22:51 GMT
okk, very true

but suppose, i this csv data

21.10.2003;6 ;0.99;Feb 96;1
22.10.2003;5 ;0.99;Feb 66;1
23.10.2003;10;0.99;Feb 34;1
24.10.2003;3  ;1    ;02.Jun;1

def. i can't say anything about "20060112", what type it can be

but if its 0.99 (like in col 3, and fourth value is int in that col.);
and i read it each value from file as String[], how can i determine its
float and fourth value is int?

i mean, any java commands or way to determine it..

thanks.. (just wanna be sure what exactly i shud do and how to do or
let user decide it)
Thomas Kellerer - 12 Jan 2006 23:01 GMT
h4hrr wrote on 12.01.2006 23:51:
> 21.10.2003;6 ;0.99;Feb 96;1
> 22.10.2003;5 ;0.99;Feb 66;1
[quoted text clipped - 6 lines]
> and i read it each value from file as String[], how can i determine its
> float and fourth value is int?

Why would you want to have a different data type for the first and
fourth value? They all go into the same database column, right? So
you'll have to use float for all of them.

But this is precisely another situation why you cannot reliable
determine the data type from a plain text file. Suppose you analyze the
first row to find the datatype and the "1" is in there, so you'll go for
 an int, but later you'll hit 0.99 and your code throws an exception.

But if you really need to do it, you could try something like this:

String valueFromFile = ....;

Object dbValue = null;
try
{
   dbValue = Integer.valueOf((valueFromFile);
}
catch (NumberFormatException e)
{
   dbValue = null;

}

if (dbValue == null)
{
  try
  {
    dbValue = Double.valueOf(valueFromFile);
  }
  catch (NumberFormatException e)
  {
    dbValue = null;
  }
}

if (dbValue == null)
{
 .....
}

You get the idea.

Basically try with one type, if that fails try the next until the
conversion does not throw an Exception. Not the fastest way, but if you
have a limited set of types, this might work. Won't help with the
situation I described in my first post though.

The actual dbValue object can be written to the database using
setObject() on a PreparedStatement.

Thomas
h4hrr - 12 Jan 2006 23:47 GMT
ah, didn't look at this before posting the last reply..

thanks for code sample.. yeah looks like its not a easy way to go and
if i check 1000 rows or so, for all the different types.. numerics,
dates,

hmm..
Eric Sosman - 12 Jan 2006 23:00 GMT
max_ny wrote On 01/12/06 17:11,:
> Hii,
> i am trying write some code to transfer data from csv file to a
[quoted text clipped - 7 lines]
> results); i can't figure out how should i determine the type of each
> column data

   There is no foolproof way.  Fundamentally, all you
get from the file is a bunch of Strings that are the
external appearance of data fields.  You're trying to
deduce the original data by gazing at its appearance,
and your problem is that there may be many different
internal data items that could produce the same String.

   "4" - It looks like a number, but is it an integer
         or a floating-point value?  Is it a percentage
         that really represents 0.04?  Or maybe it's
         part of a date: do the nearby columns contain
         7 and 1776?

   "F" - It might be a short string, as in F Murray
         Abraham.  Then again, it might be the Boolean
         value "false."

   "1:10" - Very likely a time, but what kind?  If it
         denotes an interval, does it mean 70 seconds
         or 70 minutes?  If it denotes a specific time
         of day, does it mean early in the morning or
         early in the afternoon?  Then again, it might
         be the odds your bookie is quoting.

   "1/2" - If a date, is it January 2 or February 1,
         and in what year?  Or is it just 0.5 (does
         the next column say "tsp")?

   The very best way to resolve such ambiguities is to
get some outside authority (e.g., the user) to tell you
what the columns contain.  Failing that, a reasonable
approach might be to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation.  (It may be a good idea to ignore the
first several rows; they might be column headings from
a spreadsheet or something of the kind.)

   For example, if the values found in column 3 are

    Quantity, "", 42, 29, 3001, 86, 107, 1336

you might well decide that the first two rows are titles
and the real data consists of integers in rows 3 and on.
(They could also be floating-point values or percentages,
of course, but a reasonable call might be to prefer the
"simpler" explanation.)  On the other hand, if you found

    Exchange, Rate, 1.0035, 0.9926, 1, 99.4217

it would be reasonable to assume you're looking at a
column of floating-point numbers plus two rows of titles.
Yes, one of the values would also work as an integer
(it's telling you how many gold splonders you can get for
one gold splonder), but floating-point successfully
"explains" more of the values and so is more likely to
characterize the whole column.

   Similarly, an entire column of T and F or of Y and N
is probably a column of Booleans; if A and R and L and
S are also mixed in, it's probably a column of strings.
A column consisting entirely of integers in the range 1..12
may well be month numbers; 1..31 might mean day numbers.
Maybe you're lucky and there's a leading "$" or trailing
"%" to give you a hint (careful of those dollars, though:
are they American, Canadian, Australian, or Doodledorfian?).
Start with a set of "value interpreters" of various kinds,
apply each of them to all the values in a column, and score
each by how many times it's able to make sense of what it
finds.

   It's going to be an imprecise business ...

Signature

Eric.Sosman@sun.com

h4hrr - 12 Jan 2006 23:42 GMT
Thanks for the nice explanation.

my main aim was to do the same as you mentioned,

>to take a sample of fifty or a hundred
>rows, and try several ways of "understanding" the set of
>values you find in each column: If one of them works
>more often than the others, guess that it's the right
>interpretation.

but how to determine, whether the string[] value is integer, double or
something else, i mean the java reference for it (as i said in my other
reply).. i am not sure how to do that.. the java reference or command
to determine the value type ..

what to finalize which is datatype is it at the end, may be one has to
do some heuristics for determining it out of the values read..

thanks
Eric Sosman - 13 Jan 2006 17:00 GMT
h4hrr wrote On 01/12/06 18:42,:
> Thanks for the nice explanation.
>
[quoted text clipped - 13 lines]
> what to finalize which is datatype is it at the end, may be one has to
> do some heuristics for determining it out of the values read..

   Exactly: your code needs to do some guesswork.  My
suggestion is that you come up with a limited set of
data types that you expect to find in the data: Let's
say they are Date, Integer, Float, and String.  Take
all (or a good-sized sample) of the text values in a
column and try to convert them to Dates; if none of the
texts successfully convert to Date then you conclude
that the column doesn't represent Dates.  Try again,
converting them all to Integers; if 30% of them convert
successfully and 70% don't, you can be reasonably sure
that the column doesn't represent Integers.  Try again,
this time converting every value to a Float; if they
all succeed, you can be pretty sure the column contains
floating-point numbers.  If nothing else seems to fill
the bill, declare that the column contains Strings (the
conversion of a String to a String nearly always succeeds).

   The "interesting" cases are when none of your candidate
data types successfully converts all the column's values,
but one converts "most" of the column.  You could fall
back on String (the ultimate "I don't know"), or you could
decide to reject the few un-convertable rows (especially
if other columns in the same rows prove recalcitrant).
Another "interesting" case is when two or more data types
successfully convert all or most of the data, but neither
is obviously "more specific" than the other.  These call
for judgment, a quality notably lacking in computers.

Signature

Eric.Sosman@sun.com

Roedy Green - 12 Jan 2006 23:31 GMT
>my problem is to determine the datatype of various column values in the
>csv file. I read the csv file column wise and have to determine what
>datatype each col. contains, so to map it to correct type for creating
>tables in the DB.

This is similar to the logic in my Align utility, and also in the HTML
table sorter,  that scan a file first to determine if each column is
alpha or numeric.  If all entries are numeric, I right align or sort
numerically.

So you do a pass through the file with CSVReader  (see
http://mindprod.com/jgloss/csv.html

There is an option to give you each entire line as an array of
strings.

Examine  each string, and adjust your types[col] array.

Now you know the types of each column.  You can then read the file
converting each entry to the column type and know it will work.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Francesco Devittori - 13 Jan 2006 06:38 GMT
> Hii,
> i am trying write some code to transfer data from csv file to a
[quoted text clipped - 9 lines]
>
> thanks

You could use String.matches(r) with r being a regular expression.
For example

if (col.matches("[0-10]")) { type = integer }
else if (col.matches("[0-10]\.{0-10}+")) { type = float }
else if (col.matches("(true)(false)")) { type = bool }
else if (col.matches("[0-31] [(jan)(feb)...] [0-2034]")) { type = date }
and so on...

(not sure about the regexp syntax, just to show how you could do it)

Francesco


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.