I am using the matcher and pattern classes to implement some street
address parsing.
It is a very difficult situation because of the ambiguity of the data.
I am trying to parse a) the street name eg. main, prince albert b) the
street type eg. road, rd, round about c) the optional street direction
eg. N NE
Below are the definitions for each element:
a) letters, spaces, numbers, decimals
b) letters, one optional space, and more optional letters
c) one or two letters.
This seems to be a problem parsing because it doesn't know when to stop
parsing one element in, leaving everything at the end.
I was looking at greedy and the other classifications of evaluating and
perhaps if i could say that in order to pass it must match all
characters up to the end of the string.
Please help me out with this one. Thanks,
Dave.
John C. Bollinger - 27 Jan 2006 03:49 GMT
> I am using the matcher and pattern classes to implement some street
> address parsing.
[quoted text clipped - 17 lines]
> perhaps if i could say that in order to pass it must match all
> characters up to the end of the string.
Regular expressions are not an especially good choice for this problem
because of the ambiguity you have already noted. Take a step back and
consider how *you* resolve the ambiguity when you read an address. Try
these:
// Easy:
100 Main Street N
// Harder (a real street name where I grew up, but for the 'W'):
100 Olive Street Road W
// Which "Dr." means what?
100 Rev. Dr. Martin Luther King Jr. Dr. E
How do you tell where the street name ends and the type begins? Got it?
Can you make the computer do the same?
Making a computer emulate human thought processes is not always a good
approach, but in this case I think you know some things that the
computer needs to be taught in order to do the job.

Signature
John Bollinger
jobollin@indiana.edu
Roedy Green - 27 Jan 2006 06:44 GMT
On Thu, 26 Jan 2006 22:50:02 -0500, "John C. Bollinger"
<jobollin@indiana.edu> wrote, quoted or indirectly quoted someone who
said :
>Making a computer emulate human thought processes is not always a good
>approach, but in this case I think you know some things that the
>computer needs to be taught in order to do the job.
I did some code like this years ago in abundance. The basic idea was
to recognize things then remove them, thus simplifying the problem.
One thing you might consider is a list of states, cities, streets,
surnames, to help it along.
I figure that someday there will be java classes for global
verification of addresses that are internet backed so they can do
lookups into huge databases, with local caching. Address could become
as atomic as int.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Thomas Weidenfeller - 27 Jan 2006 08:19 GMT
> I am using the matcher and pattern classes to implement some street
> address parsing.
Keep them as a string. Particular if this is for some web site order
form or similar. Not all parts of the world even use the same order of
street number and street name.
/Thomas
PS: Same for zip codes (aka postal codes), states, etc. Just keep them
as strings, don't try to verify them against US standards or a database
of US states, and don't reorder them. Unless of course you don't want
international customers.

Signature
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
Roedy Green - 27 Jan 2006 11:13 GMT
>I am using the matcher and pattern classes to implement some street
>address parsing.
the may most businesses handle this is to key the data in the first
place in to micro fields with separate fields for apt#, number,
name,direction, street type, province, postal.
You have to do this a quite different way for every country. In
Britain houses have names just like their owners.
In Japan you do an address by gradual narrowing of district to
subdistrict.
In rural areas in Canada you have RR#.
everyone validates postal codes a completely different way.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 27 Jan 2006 11:18 GMT
>This seems to be a problem parsing because it doesn't know when to stop
>parsing one element in, leaving everything at the end.
A long time ago I wrote an international address parser and verifier
and formatter in Abundance.. It was quasi table driven with a lot of
nation-specific modules. It did simple validation of postal codes and
province/states, ditto for bank account numbers and phone numbers.
I have often wondered why there is no such class in Java that is
served on the internet with the rules of the day. Perhaps it is just
too much work collecting, verifying and updating the rules, usually
not available in English.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.