Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2006

Tip: Looking for answers? Try searching our database.

Integer to Floating-Point conversions

Thread view: 
Kai Koehne - 13 May 2006 09:30 GMT
Hi,

I need an int to real32 conversion with IEEE-754 round-towards-zero mode
... That means, a method x, so that

int i = 44554435
float f = x(i)

results in f = 3355432.0, and not f = 335546.0 .

Are you aware of any mathematical packages that provide such a function,
or a way to do this with pure Java?

Regards,

Kai Koehne
joseph_daniel_zukiger@yahoo.com - 13 May 2006 10:12 GMT
> Hi,
>
[quoted text clipped - 12 lines]
>
> Kai Koehne

Huh?

I was going to mumble something automatic integer to floating point
promotion should work as advertised, but your question does not seem to
make sense.

If you absolutely must work with 32 bit floats instead of 64 bit
doubles or java.math.BigDecimal, you will lose precision when promoting
large integers. There's no way to squeeze 32 bits of precision into
less than 32 bits of precision, expecially when a 32 bit float needs a
few bits for the exponent.

But if you really need a method that converts 44554435 stored in an int
to either 3355432.0 or 335546.0 in 32 bit floating point
representation, you can always write your own. I'd prefer you give it a
reasonable name, like "mungeNumber()" or "hashAFloat()" or
my.own.strange.Promotions.conversion_x() so that if it ever gets close
to any of my code my code doesn't get infected. But you can definitely
build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
methods of java.lang.Double and java.lang.Float, but I wouldn't want to
be accused of contributing to the crime.
Kai Koehne - 13 May 2006 10:52 GMT
joseph_daniel_zukiger@yahoo.com schrieb:
>> Hi,
>>
[quoted text clipped - 7 lines]
> promotion should work as advertised, but your question does not seem to
> make sense.

Just to clarify my motivation: I am building an Occam 2 Java compiler
... and as it happens, occam supports two ways of converting an integer
to a floating-point variable:

 REAL32 ROUND n     -- conversion with round-towards-middle
 REAL32 TRUNC n     -- conversion with round-towards-zero

The first has the same semantics as (float)n , but I couldn't find a way
to implement the second conversion.

> [...]
> But if you really need a method that converts 44554435 stored in an int
[quoted text clipped - 6 lines]
> methods of java.lang.Double and java.lang.Float, but I wouldn't want to
> be accused of contributing to the crime.

Huh, this is the tough way ... but I will try to find my way through the
implementation details, thank you for the idea :-)

Kind regards,

Kai Koehne
joseph_daniel_zukiger@yahoo.com - 13 May 2006 13:40 GMT
> joseph_daniel_zukiger@yahoo.com schrieb:
> >> Hi,
[quoted text clipped - 18 lines]
> The first has the same semantics as (float)n , but I couldn't find a way
> to implement the second conversion.

Ahh. I see.

> > [...]
> > But if you really need a method that converts 44554435 stored in an int
[quoted text clipped - 9 lines]
> Huh, this is the tough way ... but I will try to find my way through the
> implementation details, thank you for the idea :-)

If what Chris suggests doesn't get there fast enough for you, you might
want to consider implementing it borrowing from java.math.BigDecimal
and BigInteger. If it turns out too slow, it could at least give you a
good baseline.

> Kind regards,
>
> Kai Koehne
Patricia Shanahan - 13 May 2006 16:28 GMT
...
> Huh, this is the tough way ... but I will try to find my way through the
> implementation details, thank you for the idea :-)
...

Here's my attempt. The test cases shown in the main method
are ALL the testing I've done.

public class FloatConvert {

  public static void main(String[] args) {
    // Original example
    test(44554435, 44554432);
    // Negation of original example
    test(-44554435, -44554432);
    // No rounding needed
    test(44554432, 44554432);
    // Rounding goes in correct direction
    test(44554433, 44554432);
  }

  private static void test(int input, int expected) {
    double actual = truncateIntToFloat(input);
    if (actual == expected) {
      System.out.printf("Success: input %d actual %f\n", input, actual);
    } else {
      System.out.printf("Failure: input %d actual %f expected %d\n", input,
          actual, expected);
    }
  }

  /**
   * Convert int to float with truncation towards zero.
   *
   * @param input
   *          Number to convert
   * @return Largest magnitude float that has the same sign as input
   *         and no greater absolute value.
   */
  public static float truncateIntToFloat(int input) {
    float rounded = (float) input;
    if (Math.abs((double) rounded) <= Math.abs((double) input)) {
      return rounded;
    } else {
      int roundedBits = Float.floatToIntBits(rounded);
      float truncated = Float.intBitsToFloat(roundedBits - 1);
      return truncated;
    }
  }
}
Chris Uppal - 14 May 2006 13:50 GMT
> Here's my attempt. The test cases shown in the main method
> are ALL the testing I've done.

It works for the other 4294967292 inputs too.  I've tried 'em all ;-)

>        int roundedBits = Float.floatToIntBits(rounded);
>        float truncated = Float.intBitsToFloat(roundedBits - 1);
>        return truncated;

That's clever.  At least, I /hope/ it's clever because I don't understand why
it works.  Do you care to expand a bit ?  Please ?

I got interested in this myself and have tried out a number of approaches,
including an attempt to convert rounding into truncation by consideration of
ULPs (as I suggested to the OP earlier).  The best, IMO, is to do the
truncation in the integer domain before converting into a float.  Its a little
complicated to do, but it runs fast, and has the aesthetic advantage of not
requiring double-precision operations to produce a single-precision result
(which all the other things I tried do).

I've appended the code, including the test-harness, for anyone who's
interested.

  -- chris

=========== Test.java ===============
/*
* contains several static implementations of truncating a 32-bit integer
* to a 32-bit float with rounding "towards zero" rather that to the nearest
* float (as in the Java language spec for casting an int to a float).
* Also contains an exhaustive test which checks all the implementations
* against the simplest one (presumed to be correct).
*
* Approximate times when run on a 1.5.0_06-b05 JVM, in "client" mode on
* a 1.5 GHz celeron WinXP Pro box.  In each case the time is the time in
* nanoseconds to convert one value averaged over the full 32-bit range of
* integers.
*
* Slow:        130
*  ULP:          63.5
*  FastULP:      65
*  Twiddle:      19
*  Patricia:     74.5
*
* Approximate times for inputs in the resticted range [-10,000,000,
10,000,000),
* 83% of which is exactly representable as a float.
*
* Slow:         12
*  ULP:          12
*  FastULP:      20.5
*  Twiddle:      18.5
*  Patricia:     24
*/
public class Test
{
   public static void
   main(String[] args)
   {
       int i = 0;
       int errors = 0;
       for (;;)
       {
           if (i % 10000000 == 0)
           {
               // progress...
               System.out.printf("%11d...\r", i);
               System.out.flush();
           }
           float shouldBe = slowTruncate(i);
           float is;

           if ((is = ulpTruncate(i)) != shouldBe)
           {
               System.out.printf("Slow: %d: %f != %f%n", i, is, shouldBe);
               errors++;
           }
           if ((is = fastTruncate(i)) != shouldBe)
           {
               System.out.printf("Fast: %d: %f != %f%n", i, is, shouldBe);
               errors++;
           }
           if ((is = twiddleTruncate(i)) != shouldBe)
           {
               System.out.printf("Twiddle: %d: %f != %f%n", i, is, shouldBe);
               errors++;
           }
           if ((is = patriciaTruncate(i)) != shouldBe)
           {
               System.out.printf("Patricia: %d: %f != %f%n", i, is, shouldBe);
               errors++;
           }

           if (errors > 10)
               break;
           if (i == -1)
               break;    // we've wrapped all the way around
           i++;
       }
       System.out.printf("%ndone, %d errors%n", errors);
   }

   /**
    * Slow implementation of truncating an int to a 32-bit floating point.
    * This implementation is intended to be the "so simple it can't possibly
    * be wrong" benchmark against which other implementations can be compared
    */
   public static float
   slowTruncate(int i)
   {
       // since all +ve numbers have negations, but the reverse
       // isn't true, we work with -ve numbers
       if (i > 0)
           return -slowTruncate(-i);

       // loop up (towards zero) until we find the right answer
       for (;;)
       {
           float f = (float)i;
           if ((double)f >= (double)i)
               return f;
           i++;
       }
   }

   /**
    * Implementation of truncating an int to a 32-bit floating point based on
    * ULP manipulation
    */
   public static float
   ulpTruncate(int i)
   {
       float f = (float)i;
       double error = (double)f - (double)i;

       // no error ?
       if (error == 0)
           return f;

       // or rounded towards 0 anyway ?
       if ((error < 0) == (i > 0))
           return f;

       // adjust by one ulp towards 0
       // the double ulp() is because ulp is the distance to the next
       // number away from 0, we need the distance to the next float
       // nearest zero
       if (i > 0)
       {
           float ulp = Math.ulp(f-Math.ulp(f));
           f -= ulp;
       }
       else
       {
           float ulp = Math.ulp(f+Math.ulp(f));
           f += ulp;
       }

       return f;
   }

   /**
    * Faster implementation of truncating an int to a 32-bit floating point.
    * This implementation uses a simple test to catch a useful range of
    * "eaasy" cases.  If the input is likely to be in that range then
    * this provides a useful optimisation, otherwise it's marginally
    * slower.
    * NB1: this version uses ulpTruncate() for the difficult cases, but
    * it could just as easily use a different one -- even slowTruncate().
    * NB2: since this consists only of a simple test guarding simple
operations,
    * it seems reasonable to hope that the JIT will inline it in callers
    */
   public static float
   fastTruncate(int i)
   {
       // these are all exact
       if (i >= -8388608 && i <= 8388608)
           return (float)i;
       else
           return ulpTruncate(i);    // or any other xxxTruncate()
   }

   /**
    * Faster implementation of truncating an int to a 32-bit floating point
based
    * on bit-twiddling.
    * The idea is to take the number, discover how many high-bits it has which
    * don't fit into the 24-bit mantissa of a float, and to mask off that many
    * low bits before converting to a float.  I.e. we do the truncation in the
    * integer domain before converting to a float, so the conversion is always
    * is always exact.
    * The actual implementation is complicated by sign issues, but that's the
    * basic idea.
    * Note that we don't actually mess with the physical layout of an IEEE
float,
    * but only use the fact that it has exactly 24 bits in which to represent
the
    * mantissa
    */
   public static float
   twiddleTruncate(int i)
   {
       if (i >= 0)
       {
           // because i is +ve the high bit is always zero,
           // and so (i >>> 24) is always in the range [0, 128)
           return (float)(i & MASKS[i >>> 24]);
       }

       // nasty special case
       if (i == Integer.MIN_VALUE)
           return -2147483648.0f;

       return -twiddleTruncate(-i);
   }

   // helper data for twiddleTruncate().  A set of precomuted masks which
   // will be indexed by bits [24, 30] (zero-based) of the input integer.
   private static final int[] MASKS = new int[128];
   static
   {
       for (int i = 0; i < 128; i++)
           MASKS[i] = -1;
       for (int bit = 0; bit < 7; bit++)
       {
           int threshold = 1 << bit;
           int mask = 1 << bit;
           for (int i = threshold; i < 128; i++)
               MASKS[i] ^= mask;
       }
   }

   /**
    * Clever implementation of truncating an int to a 32-bit floating point
posted
    * to comp.lang.java.programmer by Patricia Shanahan on 13 May 2006
    * (Message-ID: <TWm9g.126$921.60@newsread4.news.pas.earthlink.net> )
    */
   public static float
   patriciaTruncate(int i)
   {
       float rounded = (float) i;

       if (Math.abs((double) rounded) <= Math.abs((double) i))
           return rounded;

       int roundedBits = Float.floatToIntBits(rounded);
       float truncated = Float.intBitsToFloat(roundedBits - 1);

       return truncated;
   }
}
==========================
Patricia Shanahan - 14 May 2006 14:37 GMT
>> Here's my attempt. The test cases shown in the main method are ALL
>> the testing I've done.
[quoted text clipped - 6 lines]
> That's clever.  At least, I /hope/ it's clever because I don't
> understand why it works.  Do you care to expand a bit ?  Please ?

Sure. First I split the problem into two cases. If the result did not
require rounding, or the rounding was towards zero, the rounded result
and truncated result are equal, so I just returned the rounded result.

The code you quote is for those cases that round away from zero. The
float we want is the one with the same sign as rounded, and the largest
magnitude that is less than rounded.

NaNs, infinities, and zeros would all be special cases in a general
attempt to make the smallest possible reduction in the magnitude of a
float, but can be ignored here. Zero can only result from converting int
zero, and getting an exact result. The others cannot result from int
conversion.

If rounded is not an exact power of two, the truncation has a mantissa
one less than rounded. If it is an exact power of two, it has an
exponent one less than rounded, and mantissa all ones. Either way,
subtracting one from the rounded bit pattern gets the bit pattern for
truncated.

> I got interested in this myself and have tried out a number of
> approaches, including an attempt to convert rounding into truncation
[quoted text clipped - 4 lines]
> double-precision operations to produce a single-precision result
> (which all the other things I tried do).

If you prefer to avoid double, just do the abs in int:

  public static float truncateIntToFloat(int input) {
    float rounded = (float) input;
    if (Math.abs((int) rounded) <= Math.abs(input)) {
      return rounded;
    } else {
      int roundedBits = Float.floatToIntBits(rounded);
      float truncated = Float.intBitsToFloat(roundedBits - 1);
      return truncated;
    }
  }

Both numbers whose absolute values are to be compared can be exactly
represented in either int or double, so either type should work.

Patricia
Chris Uppal - 15 May 2006 14:22 GMT
> If rounded is not an exact power of two, the truncation has a mantissa
> one less than rounded. If it is an exact power of two, it has an
> exponent one less than rounded, and mantissa all ones. Either way,
> subtracting one from the rounded bit pattern gets the bit pattern for
> truncated.

Ah, I see.  The IEEE fp layout is subtle.

Thank you.

> If you prefer to avoid double, just do the abs in int:
>
[quoted text clipped - 8 lines]
>      }
>    }

Unfotunately, the comparison doesn't work for Integer.MAX_VALUE, nor for values
near Integer.MIN_VALUE.  Two example failure cases are:

 2147483647
 -2147483638

Some test code (which uses a routine from yesterday's post):
=============
public class Test2
{
public static void
main(String[] args)
{
 test(2147483647);
 test(-2147483638);
}

static void
test(int input)
{
 System.out.printf("input: %d%n", input);
 System.out.printf("desired output: %f%n", Test.slowTruncate(input));

 float rounded = (float)input;
 System.out.printf("rounded: %f%n", rounded);

 System.out.printf("(int)rounded: %d%n", (int)rounded);
 System.out.printf("Math.abs((int)rounded): %d%n", Math.abs((int)rounded));
 System.out.printf("Math.abs(input): %d%n", Math.abs(input));
 System.out.println();
}
}
=============

Which produces output:

   input: 2147483647
   desired output: 2147483520.000000
   rounded: 2147483648.000000
   (int)rounded: 2147483647
   Math.abs((int)rounded): 2147483647
   Math.abs(input): 2147483647

   input: -2147483638
   desired output: -2147483520.000000
   rounded: -2147483648.000000
   (int)rounded: -2147483648
   Math.abs((int)rounded): -2147483648
   Math.abs(input): 2147483638

I find the negative output from Math.abs() particularly entertaining ;-)

   -- chris
Patricia Shanahan - 15 May 2006 16:21 GMT
>>If rounded is not an exact power of two, the truncation has a mantissa
>>one less than rounded. If it is an exact power of two, it has an
[quoted text clipped - 7 lines]
>
>>If you prefer to avoid double, just do the abs in int:
...
> Unfotunately, the comparison doesn't work for Integer.MAX_VALUE, nor for values
> near Integer.MIN_VALUE.  Two example failure cases are:
...
> I find the negative output from Math.abs() particularly entertaining ;-)

Good point. Shows the folly of not testing at least the maximum and
minimum value, as well as some ordinary values.

Patricia
Kai Koehne - 14 May 2006 15:12 GMT
Wow! Thank you all for this thorough discussion ... Two days ago I was
not even sure if truncating conversion can be done at all, and now I
have five working solutions + performance tests. I have the strong
feeling to owe you a beverage of your choice :-)

Thank you!

Kai Koehne
Chris Uppal - 13 May 2006 11:11 GMT
> I need an int to real32 conversion with IEEE-754 round-towards-zero mode

Java doesn't include a spectacularly complete set of IEEE-754 operations,
that's one of the ones that's missing.  I believe that more stuff will be added
in 1.6 (see http://download.java.net/jdk6/docs/api/java/lang/Math.html
) but I don't know if that will provide what you need.

If not, or if you can't wait, then I think you'll just have to program it
yourself.  Math.ulp(float) may help.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.