Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2008

Tip: Looking for answers? Try searching our database.

Emoticon text parser

Thread view: 
Karsten Wutzke - 20 Mar 2008 18:40 GMT
Hello,

how do I write a text parser that will detect many ":-)", "]:->"
strings so that they can be replaced with small icons a text
component? Can someone direct me to some classes which might be
useful? Pattern? Looks complicated... BTW there's no real pattern in
those codes as I also use custom codes for other symbols, e.g. (cig)
oder :cig:, haven't  decided that yet...

TIA
Karsten
Peter Duniho - 20 Mar 2008 19:01 GMT
> how do I write a text parser that will detect many ":-)", "]:->"
> strings so that they can be replaced with small icons a text
> component?

That all depends on how you're formatting the text to include the icons.

But, for example, if you're using HTML in one of the Swing controls that  
can display HTML, you might just use the java.util.regex package useful.  
Just use it to do a straight search-and-replace of emoticon strings with  
the appropriate HTML "img" tag.

Other than that, your question is pretty broadly worded.  You're either  
going to get a lot of replies that aren't applicable, or none at all due  
to the vagueness of the problem description.

Pete
Karsten Wutzke - 20 Mar 2008 20:56 GMT
> > how do I write a text parser that will detect many ":-)", "]:->"
> > strings so that they can be replaced with small icons a text
[quoted text clipped - 12 lines]
>
> Pete

The strings to be parsed are (at first) plain strings without any
control codes or any programming language constructs... but they
contain completely differing smiley/emoticon code, where some codes
might contain other, so I have to figure that out:

Example:

one might express *very sad* or *weeping* by writing the string ":-(("
this smiley code however starts with ":-(", the smiley code for *sad*
or *unhappy*...

OK solution here would be to sort the order of smileys tested against
by their code string length, here first check ":-((", then ":-("...

But I need some general parsing approach. Regex really necessary?
Pattern? Looks very complicated to me... I doubt I can use a pattern
for codes that range from ":-)" over "}:->" over ";)" to (CIG)...

Karsten
Peter Duniho - 20 Mar 2008 21:09 GMT
> [...]
> one might express *very sad* or *weeping* by writing the string ":-(("
[quoted text clipped - 7 lines]
> Pattern? Looks very complicated to me... I doubt I can use a pattern
> for codes that range from ":-)" over "}:->" over ";)" to (CIG)...

While I'm not an expert in using regular expressions, I know for a fact  
they can in fact address the scenarios you're describing.  It's a fairly  
powerful language in its own right.  You may well need some help from an  
actual expert, or be prepared to spend a lot of time (days, at least,  
longer depending on your own skills and specific needs) learning it well  
enough to meet your needs.  But it can do it.

Inasmuch as the problem itself is complicated, so too will the solution  
be.  I'm not convinced there's any way around that.

Is regex "necessary"?  No, not at all.  But at first blush, your problem  
seems to be exactly what regular expressions were designed to do: find  
specific patterns in strings and, optionally, replace those patterns with  
new patterns (text).  In that sense, doesn't it make sense to explore that  
as a possible solution?

Pete
Karsten Wutzke - 20 Mar 2008 23:45 GMT
> > [...]
> > one might express *very sad* or *weeping* by writing the string ":-(("
[quoted text clipped - 25 lines]
>
> Pete

Yap it's worth a try. Hmm ok, I started out analyzing the structure of
the smiley codes.

They consist of:

 1       1         1         1       1-2       1       <- number of
characters
[hair] - eyes - [subeyes] - [nose] - mouth - [beard]

OK. subeyes and nose are OPTIONAL. It seems only mouth has more than 1
char, but max 2. So this gives 6 positions/properties and max. 7
chars.

Here are the possible strings applying to each position:

hair    = {"o", "O", ">", "}", "]", ")"}  <-- hair optional!
eyes    = {":", ";", "8"}
subeyes = {"'", ","}                      <-- subeyes optional!
nose    = {"-"}                           <-- nose optional!
mouth   = {")", "(", "s", "S", "d", "D",
          "p", "P", "c", "C", "o", "O",
          "#", "@", "*", "$", "|",
          "))", "(("}
beard   = {"="}                           <-- beard optional!

It basically ought to ignore all UPPER and lower case for letters so
both are valid. As you can see there is almost every regex special
character involved so the resulting pattern will look awkward (at
least to me).

Other than that I might design the codes containing only exactly 1
char per position, if that would simplyfy things or make it possible
at all. It would not be a problem to introduce an optional [subnose]

...     1          1         1       1
... - [nose] - [subnose] - mouth - beard

So the minimum smiley code length is 2, max is 7.

A fictuous (length 7) example pattern recognized could be: "};'O))="

This would be a bearded ("=") winking (";") weeping ("'") devil ("}")
very happy ("))") with an (UPPERCASE) pigs nose ("O") (-> makes sense
right? ;-) ). OK, now this would qualify as a potential pattern match.
If that matched the parsed string, I would check the actual map
TreeMap<String,ImageIcon> of smiley images actually available. If an
icon was found, I knew it's time to replace the string with that
image. Sounds easy...

However I have no idea how to construct the regex for this. I probably
don't have that much time to learn from scratch, I believe "Perl's"
pattern language can do a lot of veery complicated things that might
not even narrowly touch what I need.

Maybe some "expert" here might be able to construct the pattern or at
least can direct me to the right paragraphs at

http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

Can anyone assist me please?

Help very much appreciated!

Karsten
Jussi Piitulainen - 21 Mar 2008 10:21 GMT
> Here are the possible strings applying to each position:
>
[quoted text clipped - 7 lines]
>            "))", "(("}
> beard   = {"="}                           <-- beard optional!

That is very close to a regular expression already. It's as if your
are spelling out the meaning of such an expression here.

Most of these are character sets. The exceptions are the two
two-character mouths, so mouth must be partly an alternation.

hair    = [oO>}\])]?           "]" must be escaped
eyes    = [:;8]                no problem
subeyes = [',]?
nose    = -

mouth   = (?:[sSdDpPcCoO#@*$|]|\)\)?|\(\(?)

         This is [...] | one or two of ) | one or two of (,
         parentheses need escaping, and I've wrapped it all
         in (? ) to make it a non-capturing group.

beard   = =?

Put it all together, in a string, which requires doubling the escapes:
"[oO>}\\])]?[:;8][',]?-(?:[sSdDpPcCoO#@*$|]|\\)\\)?|\\(\\(?)=?". Ouch.
It does look ugly.

We can ease the pain with the COMMENT flag of Pattern; must escape the
comment character # then; end comments with ends of line. Let's make
it CASE_INSENSITIVE too.

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Test {
   public static void main(String [] args) {
    Pattern p =
       Pattern.compile
       ("[o>}\\])]?      # hair, optional    \n" +
        "[:;8]           # eyes              \n" +
        "[',]?           # subeyes, optional \n" +
        "-?              # nose, optional    \n" +
        "(?: [sdpco\\#@*$|]                    " +
        "  | \\)\\)?                           " +
        "  | \\(\\(? )   # mouth             \n" +
        "=?              # beard, optional   \n",
        Pattern.COMMENTS | Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(args[0]);
    while (m.find()) {
       System.out.println("Found " + m.group() + " at " +
                  m.start() + " to " + m.end());
    }
   }
}

That's about the best I can do.
Christian - 21 Mar 2008 14:38 GMT
Jussi Piitulainen schrieb:

>> Here are the possible strings applying to each position:
>>
[quoted text clipped - 59 lines]
>
> That's about the best I can do.
as a tip I have often found it handy to put pattern together from
subpattern than all in one big unredable/commented string..
to do this just
surround all subpatterns with a non capturing group though they are not
mandatory with single character patterns..
ex String hair = "(?:[oO>}\])])  ;
String eyes = "(?:[:;8])
...

then
Pattern p =
Pattern.compile(hair+"?"+eyes+subeyes+"?"+nose+"?"+mouth+beard+"?");
specially with longer and more complex regexps .. this helps alot...
I am specially thinking of regexps like one for an IP address:

String BYTE     = "(?:(?:[01]?\\d\\d?)|(?:2[0-4]\\d)|(?:25[0-5]))";
String IP     = "(?:"+BYTE+"\\."+BYTE+"\\."+BYTE+"\\."+BYTE+")";

As I am already OT:

Is someone else than just me missing some regexp database in the java
api? Some hundred predefined Strings for common stuff like IP addresses.
URLs , URIs .. Numbers: byte .. short.. int .. long

Christian
Karsten Wutzke - 22 Mar 2008 21:37 GMT
On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@ling.helsinki.fi>
wrote:
> > Here are the possible strings applying to each position:
>
[quoted text clipped - 60 lines]
>
> That's about the best I can do.

And it is great! It works like a charm and even seems to be fast as
lightning... I also split up the sub components into several strings
as Christian suggested instead of the commenting stuff. I suppose this
was made is for loading (commented) files from disk.

One question that remains is:

The pattern really just addresses strings that are *exactly* 2-7 chars
long. Do I understand right, that there's no way to automatically
detect a pattern ":-)" in the string " :-)" or ":-) " or
"    :-)              " directly???

Do I always have to make a list of starting characters and then scan
for a 7 char string, a 6 lenght, a 5 length... until maybe one pattern
matched?

Karsten

PS: I'm really really happy :-D ATM
Karsten Wutzke - 24 Mar 2008 16:41 GMT
> On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@ling.helsinki.fi>
> wrote:
[quoted text clipped - 83 lines]
>
> PS: I'm really really happy :-D ATM

Would be great if someone could look over that last question I have, I
suspect it got somewhat overlooked due to the chinese spammer...

Karsten
Christian - 24 Mar 2008 17:47 GMT
Karsten Wutzke schrieb:
>> On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@ling.helsinki.fi>
>> wrote:
[quoted text clipped - 62 lines]
>> detect a pattern ":-)" in the string " :-)" or ":-) " or
>> "    :-)              " directly???

Use find() method on the matcher

>> Do I always have to make a list of starting characters and then scan
>> for a 7 char string, a 6 lenght, a 5 length... until maybe one pattern
>> matched?

you already have made a pattern that matches all 2-7 char smileys .. use
find() to find one after the other in a string with any length..

>> Karsten
>>
[quoted text clipped - 4 lines]
>
> Karsten
Karsten Wutzke - 25 Mar 2008 11:34 GMT
> Use find() method on the matcher
>
[quoted text clipped - 4 lines]
> you already have made a pattern that matches all 2-7 char smileys .. use
> find() to find one after the other in a string with any length..

Sorry my fault, I used the matches method which delivers true only for
strings that match the whole pattern but doesn't ignore any leading or
trailing characters.

Works beautifully now. Thanks for the great help on this!
Karsten
Roedy Green - 21 Mar 2008 00:39 GMT
On Thu, 20 Mar 2008 13:09:22 -0700, "Peter Duniho"
<NpOeStPeAdM@nnowslpianmk.com> wrote, quoted or indirectly quoted
someone who said :

>While I'm not an expert in using regular expressions, I know for a fact  
>they can in fact address the scenarios you're describing.  

Regexes are for looking for patterns.  Here you just have simple
Strings.  So regexes are overkill.

A masochist would try to encode the similar emoticons as regex
expressions.   However they would be extremely difficult to debug and
impossible to maintain.  You would have to generate the patterns with
a program that analysed your list of emoticons.

However, even if you did that, the regex mechanism does not tell you
WHICH pattern it found, so you have to redo the recognition logic once
the regex determined there was an emoticon present.

One problem to watch out for:

is ":-)-------" an emoticon?  In other words, do you parse to find the
end of emoticonny characters first and insist on a complete match?

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Roedy Green - 21 Mar 2008 00:33 GMT
>how do I write a text parser that will detect many ":-)", "]:->"
>strings so that they can be replaced with small icons a text
>component? Can someone direct me to some classes which might be
>useful? Pattern? Looks complicated... BTW there's no real pattern in
>those codes as I also use custom codes for other symbols, e.g. (cig)
>oder :cig:, haven't  decided that yet...

Here are 5 ways to tackle your problem:

1. see http://mindprod.com/jgloss/parser.html

2. You could do it crudely but simply with a table and  loop through
the table using indexOf until you find all the emoticons. Look for the
longest ones first.

3. A very fast, but hard-to-maintain technique would be to do it with
a case statement for the first char that then looks at the second char
etc.

4. The mathematically inclined might write a program to analyse the
list of emoticons and generate code for a finite state automaton.  See
http://mindprod.com/jgloss/finitestate.html

5. A practical solution might be to make a list of chars that start
emoticons.  Then for each char build a list of emoticons that start
with that char.  Now scan for emoticon-starting chars. When you find
one, compare the look-ahead with all the candidate emoticons that
start with that letter.  you could implement this as a case with if
for the emoticon-starting letters e.g.

switch ( nextChar )
{
case ':':  return look.substr(0,3).equals(":-)") ||
look.substr(0,3).equals(":-(");

case '<': return look.substr(0,3).equals("<:)");
}
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Karsten Wutzke - 21 Mar 2008 02:59 GMT
On 21 Mrz., 00:33, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:

> >how do I write a text parser that will detect many ":-)", "]:->"
> >strings so that they can be replaced with small icons a text
[quoted text clipped - 36 lines]
> Roedy Green Canadian Mind Products
> The Java Glossaryhttp://mindprod.com

Hmmm you really brought me to an idea:

As I said I have a map of all emoticons already, like
TreeMap<String,ImageIcon> key -> value "}:'-DD=" -> icon. I will look
for starting characters. When I found one, I'll take the longest
substring from there on (max len 7 = "}:'-DD=") and use that as the
key into the ImageIcon map, if the map returned null, I would take a
substring of one less (6 = "}:'-DD"), grab into the map again, if not
found take substring 5 ("}:'-DD"), if not found take substring 4 ("}:'-
D") etc. on the first find, this is the ImageIcon, otherwise take the
next starting char.

The map contains duplicate ImageIcons of course... several codes
(keys) for one and the same image icon.

Thanks for the head start. I hope I'm not wrong... let's see. Have to
get a new beer. ;-)

Karsten
Roedy Green - 21 Mar 2008 08:32 GMT
>The map contains duplicate ImageIcons of course... several codes
>(keys) for one and the same image icon.

a variant of that is to classify chars as emoticon and non-emoticon.

Then when you hit an emoticon char, you scan to the next non-emoticon.
Then you look up that piece in the middle in your HashMap.

In other words, you look up the longest possibility.
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.