Java Forum / General / January 2006
Determine datatype
max_ny - 12 Jan 2006 22:11 GMT Hii, i am trying write some code to transfer data from csv file to a database. my problem is to determine the datatype of various column values in the csv file. I read the csv file column wise and have to determine what datatype each col. contains, so to map it to correct type for creating tables in the DB. Can someone provide me any reference for determining datatype of a col. value read from csv file, (my google search couldn't yield exact results); i can't figure out how should i determine the type of each column data
thanks
Thomas Kellerer - 12 Jan 2006 22:30 GMT max_ny wrote on 12.01.2006 23:11:
> Hii, > i am trying write some code to transfer data from csv file to a [quoted text clipped - 7 lines] > results); i can't figure out how should i determine the type of each > column data There is no way you can do it 100%. Any character sequence you read could be mapped to several data types.
Take the sequence "20060112". This could be a VARCHAR column that happens to have no characters (only numbers), or it could be an integer value, or it could be a float value that happens to have no decimal digits, or it could be a date.
I don't see a different solution then to let your user provide the necessary information.
Thomas
h4hrr - 12 Jan 2006 22:51 GMT okk, very true
but suppose, i this csv data
21.10.2003;6 ;0.99;Feb 96;1 22.10.2003;5 ;0.99;Feb 66;1 23.10.2003;10;0.99;Feb 34;1 24.10.2003;3 ;1 ;02.Jun;1
def. i can't say anything about "20060112", what type it can be
but if its 0.99 (like in col 3, and fourth value is int in that col.); and i read it each value from file as String[], how can i determine its float and fourth value is int?
i mean, any java commands or way to determine it..
thanks.. (just wanna be sure what exactly i shud do and how to do or let user decide it)
Thomas Kellerer - 12 Jan 2006 23:01 GMT h4hrr wrote on 12.01.2006 23:51:
> 21.10.2003;6 ;0.99;Feb 96;1 > 22.10.2003;5 ;0.99;Feb 66;1 [quoted text clipped - 6 lines] > and i read it each value from file as String[], how can i determine its > float and fourth value is int? Why would you want to have a different data type for the first and fourth value? They all go into the same database column, right? So you'll have to use float for all of them.
But this is precisely another situation why you cannot reliable determine the data type from a plain text file. Suppose you analyze the first row to find the datatype and the "1" is in there, so you'll go for an int, but later you'll hit 0.99 and your code throws an exception.
But if you really need to do it, you could try something like this:
String valueFromFile = ....;
Object dbValue = null; try { dbValue = Integer.valueOf((valueFromFile); } catch (NumberFormatException e) { dbValue = null;
}
if (dbValue == null) { try { dbValue = Double.valueOf(valueFromFile); } catch (NumberFormatException e) { dbValue = null; } }
if (dbValue == null) { ..... }
You get the idea.
Basically try with one type, if that fails try the next until the conversion does not throw an Exception. Not the fastest way, but if you have a limited set of types, this might work. Won't help with the situation I described in my first post though.
The actual dbValue object can be written to the database using setObject() on a PreparedStatement.
Thomas
h4hrr - 12 Jan 2006 23:47 GMT ah, didn't look at this before posting the last reply..
thanks for code sample.. yeah looks like its not a easy way to go and if i check 1000 rows or so, for all the different types.. numerics, dates,
hmm..
Eric Sosman - 12 Jan 2006 23:00 GMT max_ny wrote On 01/12/06 17:11,:
> Hii, > i am trying write some code to transfer data from csv file to a [quoted text clipped - 7 lines] > results); i can't figure out how should i determine the type of each > column data There is no foolproof way. Fundamentally, all you get from the file is a bunch of Strings that are the external appearance of data fields. You're trying to deduce the original data by gazing at its appearance, and your problem is that there may be many different internal data items that could produce the same String.
"4" - It looks like a number, but is it an integer or a floating-point value? Is it a percentage that really represents 0.04? Or maybe it's part of a date: do the nearby columns contain 7 and 1776?
"F" - It might be a short string, as in F Murray Abraham. Then again, it might be the Boolean value "false."
"1:10" - Very likely a time, but what kind? If it denotes an interval, does it mean 70 seconds or 70 minutes? If it denotes a specific time of day, does it mean early in the morning or early in the afternoon? Then again, it might be the odds your bookie is quoting.
"1/2" - If a date, is it January 2 or February 1, and in what year? Or is it just 0.5 (does the next column say "tsp")?
The very best way to resolve such ambiguities is to get some outside authority (e.g., the user) to tell you what the columns contain. Failing that, a reasonable approach might be to take a sample of fifty or a hundred rows, and try several ways of "understanding" the set of values you find in each column: If one of them works more often than the others, guess that it's the right interpretation. (It may be a good idea to ignore the first several rows; they might be column headings from a spreadsheet or something of the kind.)
For example, if the values found in column 3 are
Quantity, "", 42, 29, 3001, 86, 107, 1336
you might well decide that the first two rows are titles and the real data consists of integers in rows 3 and on. (They could also be floating-point values or percentages, of course, but a reasonable call might be to prefer the "simpler" explanation.) On the other hand, if you found
Exchange, Rate, 1.0035, 0.9926, 1, 99.4217
it would be reasonable to assume you're looking at a column of floating-point numbers plus two rows of titles. Yes, one of the values would also work as an integer (it's telling you how many gold splonders you can get for one gold splonder), but floating-point successfully "explains" more of the values and so is more likely to characterize the whole column.
Similarly, an entire column of T and F or of Y and N is probably a column of Booleans; if A and R and L and S are also mixed in, it's probably a column of strings. A column consisting entirely of integers in the range 1..12 may well be month numbers; 1..31 might mean day numbers. Maybe you're lucky and there's a leading "$" or trailing "%" to give you a hint (careful of those dollars, though: are they American, Canadian, Australian, or Doodledorfian?). Start with a set of "value interpreters" of various kinds, apply each of them to all the values in a column, and score each by how many times it's able to make sense of what it finds.
It's going to be an imprecise business ...
 Signature Eric.Sosman@sun.com
h4hrr - 12 Jan 2006 23:42 GMT Thanks for the nice explanation.
my main aim was to do the same as you mentioned,
>to take a sample of fifty or a hundred >rows, and try several ways of "understanding" the set of >values you find in each column: If one of them works >more often than the others, guess that it's the right >interpretation. but how to determine, whether the string[] value is integer, double or something else, i mean the java reference for it (as i said in my other reply).. i am not sure how to do that.. the java reference or command to determine the value type ..
what to finalize which is datatype is it at the end, may be one has to do some heuristics for determining it out of the values read..
thanks
Eric Sosman - 13 Jan 2006 17:00 GMT h4hrr wrote On 01/12/06 18:42,:
> Thanks for the nice explanation. > [quoted text clipped - 13 lines] > what to finalize which is datatype is it at the end, may be one has to > do some heuristics for determining it out of the values read.. Exactly: your code needs to do some guesswork. My suggestion is that you come up with a limited set of data types that you expect to find in the data: Let's say they are Date, Integer, Float, and String. Take all (or a good-sized sample) of the text values in a column and try to convert them to Dates; if none of the texts successfully convert to Date then you conclude that the column doesn't represent Dates. Try again, converting them all to Integers; if 30% of them convert successfully and 70% don't, you can be reasonably sure that the column doesn't represent Integers. Try again, this time converting every value to a Float; if they all succeed, you can be pretty sure the column contains floating-point numbers. If nothing else seems to fill the bill, declare that the column contains Strings (the conversion of a String to a String nearly always succeeds).
The "interesting" cases are when none of your candidate data types successfully converts all the column's values, but one converts "most" of the column. You could fall back on String (the ultimate "I don't know"), or you could decide to reject the few un-convertable rows (especially if other columns in the same rows prove recalcitrant). Another "interesting" case is when two or more data types successfully convert all or most of the data, but neither is obviously "more specific" than the other. These call for judgment, a quality notably lacking in computers.
 Signature Eric.Sosman@sun.com
Roedy Green - 12 Jan 2006 23:31 GMT >my problem is to determine the datatype of various column values in the >csv file. I read the csv file column wise and have to determine what >datatype each col. contains, so to map it to correct type for creating >tables in the DB. This is similar to the logic in my Align utility, and also in the HTML table sorter, that scan a file first to determine if each column is alpha or numeric. If all entries are numeric, I right align or sort numerically.
So you do a pass through the file with CSVReader (see http://mindprod.com/jgloss/csv.html
There is an option to give you each entire line as an array of strings.
Examine each string, and adjust your types[col] array.
Now you know the types of each column. You can then read the file converting each entry to the column type and know it will work.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Francesco Devittori - 13 Jan 2006 06:38 GMT > Hii, > i am trying write some code to transfer data from csv file to a [quoted text clipped - 9 lines] > > thanks You could use String.matches(r) with r being a regular expression. For example
if (col.matches("[0-10]")) { type = integer } else if (col.matches("[0-10]\.{0-10}+")) { type = float } else if (col.matches("(true)(false)")) { type = bool } else if (col.matches("[0-31] [(jan)(feb)...] [0-2034]")) { type = date } and so on...
(not sure about the regexp syntax, just to show how you could do it)
Francesco
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|