Java Forum / General / March 2008
Emoticon text parser
Karsten Wutzke - 20 Mar 2008 18:40 GMT Hello,
how do I write a text parser that will detect many ":-)", "]:->" strings so that they can be replaced with small icons a text component? Can someone direct me to some classes which might be useful? Pattern? Looks complicated... BTW there's no real pattern in those codes as I also use custom codes for other symbols, e.g. (cig) oder :cig:, haven't decided that yet...
TIA Karsten
Peter Duniho - 20 Mar 2008 19:01 GMT > how do I write a text parser that will detect many ":-)", "]:->" > strings so that they can be replaced with small icons a text > component? That all depends on how you're formatting the text to include the icons.
But, for example, if you're using HTML in one of the Swing controls that can display HTML, you might just use the java.util.regex package useful. Just use it to do a straight search-and-replace of emoticon strings with the appropriate HTML "img" tag.
Other than that, your question is pretty broadly worded. You're either going to get a lot of replies that aren't applicable, or none at all due to the vagueness of the problem description.
Pete
Karsten Wutzke - 20 Mar 2008 20:56 GMT > > how do I write a text parser that will detect many ":-)", "]:->" > > strings so that they can be replaced with small icons a text [quoted text clipped - 12 lines] > > Pete The strings to be parsed are (at first) plain strings without any control codes or any programming language constructs... but they contain completely differing smiley/emoticon code, where some codes might contain other, so I have to figure that out:
Example:
one might express *very sad* or *weeping* by writing the string ":-((" this smiley code however starts with ":-(", the smiley code for *sad* or *unhappy*...
OK solution here would be to sort the order of smileys tested against by their code string length, here first check ":-((", then ":-("...
But I need some general parsing approach. Regex really necessary? Pattern? Looks very complicated to me... I doubt I can use a pattern for codes that range from ":-)" over "}:->" over ";)" to (CIG)...
Karsten
Peter Duniho - 20 Mar 2008 21:09 GMT > [...] > one might express *very sad* or *weeping* by writing the string ":-((" [quoted text clipped - 7 lines] > Pattern? Looks very complicated to me... I doubt I can use a pattern > for codes that range from ":-)" over "}:->" over ";)" to (CIG)... While I'm not an expert in using regular expressions, I know for a fact they can in fact address the scenarios you're describing. It's a fairly powerful language in its own right. You may well need some help from an actual expert, or be prepared to spend a lot of time (days, at least, longer depending on your own skills and specific needs) learning it well enough to meet your needs. But it can do it.
Inasmuch as the problem itself is complicated, so too will the solution be. I'm not convinced there's any way around that.
Is regex "necessary"? No, not at all. But at first blush, your problem seems to be exactly what regular expressions were designed to do: find specific patterns in strings and, optionally, replace those patterns with new patterns (text). In that sense, doesn't it make sense to explore that as a possible solution?
Pete
Karsten Wutzke - 20 Mar 2008 23:45 GMT > > [...] > > one might express *very sad* or *weeping* by writing the string ":-((" [quoted text clipped - 25 lines] > > Pete Yap it's worth a try. Hmm ok, I started out analyzing the structure of the smiley codes.
They consist of:
1 1 1 1 1-2 1 <- number of characters [hair] - eyes - [subeyes] - [nose] - mouth - [beard]
OK. subeyes and nose are OPTIONAL. It seems only mouth has more than 1 char, but max 2. So this gives 6 positions/properties and max. 7 chars.
Here are the possible strings applying to each position:
hair = {"o", "O", ">", "}", "]", ")"} <-- hair optional! eyes = {":", ";", "8"} subeyes = {"'", ","} <-- subeyes optional! nose = {"-"} <-- nose optional! mouth = {")", "(", "s", "S", "d", "D", "p", "P", "c", "C", "o", "O", "#", "@", "*", "$", "|", "))", "(("} beard = {"="} <-- beard optional!
It basically ought to ignore all UPPER and lower case for letters so both are valid. As you can see there is almost every regex special character involved so the resulting pattern will look awkward (at least to me).
Other than that I might design the codes containing only exactly 1 char per position, if that would simplyfy things or make it possible at all. It would not be a problem to introduce an optional [subnose]
... 1 1 1 1 ... - [nose] - [subnose] - mouth - beard
So the minimum smiley code length is 2, max is 7.
A fictuous (length 7) example pattern recognized could be: "};'O))="
This would be a bearded ("=") winking (";") weeping ("'") devil ("}") very happy ("))") with an (UPPERCASE) pigs nose ("O") (-> makes sense right? ;-) ). OK, now this would qualify as a potential pattern match. If that matched the parsed string, I would check the actual map TreeMap<String,ImageIcon> of smiley images actually available. If an icon was found, I knew it's time to replace the string with that image. Sounds easy...
However I have no idea how to construct the regex for this. I probably don't have that much time to learn from scratch, I believe "Perl's" pattern language can do a lot of veery complicated things that might not even narrowly touch what I need.
Maybe some "expert" here might be able to construct the pattern or at least can direct me to the right paragraphs at
http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
Can anyone assist me please?
Help very much appreciated!
Karsten
Jussi Piitulainen - 21 Mar 2008 10:21 GMT > Here are the possible strings applying to each position: > [quoted text clipped - 7 lines] > "))", "(("} > beard = {"="} <-- beard optional! That is very close to a regular expression already. It's as if your are spelling out the meaning of such an expression here.
Most of these are character sets. The exceptions are the two two-character mouths, so mouth must be partly an alternation.
hair = [oO>}\])]? "]" must be escaped eyes = [:;8] no problem subeyes = [',]? nose = -
mouth = (?:[sSdDpPcCoO#@*$|]|\)\)?|\(\(?)
This is [...] | one or two of ) | one or two of (, parentheses need escaping, and I've wrapped it all in (? ) to make it a non-capturing group.
beard = =?
Put it all together, in a string, which requires doubling the escapes: "[oO>}\\])]?[:;8][',]?-(?:[sSdDpPcCoO#@*$|]|\\)\\)?|\\(\\(?)=?". Ouch. It does look ugly.
We can ease the pain with the COMMENT flag of Pattern; must escape the comment character # then; end comments with ends of line. Let's make it CASE_INSENSITIVE too.
import java.util.regex.Pattern; import java.util.regex.Matcher; class Test { public static void main(String [] args) { Pattern p = Pattern.compile ("[o>}\\])]? # hair, optional \n" + "[:;8] # eyes \n" + "[',]? # subeyes, optional \n" + "-? # nose, optional \n" + "(?: [sdpco\\#@*$|] " + " | \\)\\)? " + " | \\(\\(? ) # mouth \n" + "=? # beard, optional \n", Pattern.COMMENTS | Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(args[0]); while (m.find()) { System.out.println("Found " + m.group() + " at " + m.start() + " to " + m.end()); } } }
That's about the best I can do.
Christian - 21 Mar 2008 14:38 GMT Jussi Piitulainen schrieb:
>> Here are the possible strings applying to each position: >> [quoted text clipped - 59 lines] > > That's about the best I can do. as a tip I have often found it handy to put pattern together from subpattern than all in one big unredable/commented string.. to do this just surround all subpatterns with a non capturing group though they are not mandatory with single character patterns.. ex String hair = "(?:[oO>}\])]) ; String eyes = "(?:[:;8]) ...
then Pattern p = Pattern.compile(hair+"?"+eyes+subeyes+"?"+nose+"?"+mouth+beard+"?"); specially with longer and more complex regexps .. this helps alot... I am specially thinking of regexps like one for an IP address:
String BYTE = "(?:(?:[01]?\\d\\d?)|(?:2[0-4]\\d)|(?:25[0-5]))"; String IP = "(?:"+BYTE+"\\."+BYTE+"\\."+BYTE+"\\."+BYTE+")";
As I am already OT:
Is someone else than just me missing some regexp database in the java api? Some hundred predefined Strings for common stuff like IP addresses. URLs , URIs .. Numbers: byte .. short.. int .. long
Christian
Karsten Wutzke - 22 Mar 2008 21:37 GMT On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@ling.helsinki.fi> wrote:
> > Here are the possible strings applying to each position: > [quoted text clipped - 60 lines] > > That's about the best I can do. And it is great! It works like a charm and even seems to be fast as lightning... I also split up the sub components into several strings as Christian suggested instead of the commenting stuff. I suppose this was made is for loading (commented) files from disk.
One question that remains is:
The pattern really just addresses strings that are *exactly* 2-7 chars long. Do I understand right, that there's no way to automatically detect a pattern ":-)" in the string " :-)" or ":-) " or " :-) " directly???
Do I always have to make a list of starting characters and then scan for a 7 char string, a 6 lenght, a 5 length... until maybe one pattern matched?
Karsten
PS: I'm really really happy :-D ATM
Karsten Wutzke - 24 Mar 2008 16:41 GMT > On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@ling.helsinki.fi> > wrote: [quoted text clipped - 83 lines] > > PS: I'm really really happy :-D ATM Would be great if someone could look over that last question I have, I suspect it got somewhat overlooked due to the chinese spammer...
Karsten
Christian - 24 Mar 2008 17:47 GMT Karsten Wutzke schrieb:
>> On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@ling.helsinki.fi> >> wrote: [quoted text clipped - 62 lines] >> detect a pattern ":-)" in the string " :-)" or ":-) " or >> " :-) " directly??? Use find() method on the matcher
>> Do I always have to make a list of starting characters and then scan >> for a 7 char string, a 6 lenght, a 5 length... until maybe one pattern >> matched? you already have made a pattern that matches all 2-7 char smileys .. use find() to find one after the other in a string with any length..
>> Karsten >> [quoted text clipped - 4 lines] > > Karsten Karsten Wutzke - 25 Mar 2008 11:34 GMT > Use find() method on the matcher > [quoted text clipped - 4 lines] > you already have made a pattern that matches all 2-7 char smileys .. use > find() to find one after the other in a string with any length.. Sorry my fault, I used the matches method which delivers true only for strings that match the whole pattern but doesn't ignore any leading or trailing characters.
Works beautifully now. Thanks for the great help on this! Karsten
Roedy Green - 21 Mar 2008 00:39 GMT On Thu, 20 Mar 2008 13:09:22 -0700, "Peter Duniho" <NpOeStPeAdM@nnowslpianmk.com> wrote, quoted or indirectly quoted someone who said :
>While I'm not an expert in using regular expressions, I know for a fact >they can in fact address the scenarios you're describing. Regexes are for looking for patterns. Here you just have simple Strings. So regexes are overkill.
A masochist would try to encode the similar emoticons as regex expressions. However they would be extremely difficult to debug and impossible to maintain. You would have to generate the patterns with a program that analysed your list of emoticons.
However, even if you did that, the regex mechanism does not tell you WHICH pattern it found, so you have to redo the recognition logic once the regex determined there was an emoticon present.
One problem to watch out for:
is ":-)-------" an emoticon? In other words, do you parse to find the end of emoticonny characters first and insist on a complete match?
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Roedy Green - 21 Mar 2008 00:33 GMT >how do I write a text parser that will detect many ":-)", "]:->" >strings so that they can be replaced with small icons a text >component? Can someone direct me to some classes which might be >useful? Pattern? Looks complicated... BTW there's no real pattern in >those codes as I also use custom codes for other symbols, e.g. (cig) >oder :cig:, haven't decided that yet... Here are 5 ways to tackle your problem:
1. see http://mindprod.com/jgloss/parser.html
2. You could do it crudely but simply with a table and loop through the table using indexOf until you find all the emoticons. Look for the longest ones first.
3. A very fast, but hard-to-maintain technique would be to do it with a case statement for the first char that then looks at the second char etc. 4. The mathematically inclined might write a program to analyse the list of emoticons and generate code for a finite state automaton. See http://mindprod.com/jgloss/finitestate.html
5. A practical solution might be to make a list of chars that start emoticons. Then for each char build a list of emoticons that start with that char. Now scan for emoticon-starting chars. When you find one, compare the look-ahead with all the candidate emoticons that start with that letter. you could implement this as a case with if for the emoticon-starting letters e.g.
switch ( nextChar ) { case ':': return look.substr(0,3).equals(":-)") || look.substr(0,3).equals(":-(");
case '<': return look.substr(0,3).equals("<:)"); }
 Signature
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Karsten Wutzke - 21 Mar 2008 02:59 GMT On 21 Mrz., 00:33, Roedy Green <see_webs...@mindprod.com.invalid> wrote:
> >how do I write a text parser that will detect many ":-)", "]:->" > >strings so that they can be replaced with small icons a text [quoted text clipped - 36 lines] > Roedy Green Canadian Mind Products > The Java Glossaryhttp://mindprod.com Hmmm you really brought me to an idea:
As I said I have a map of all emoticons already, like TreeMap<String,ImageIcon> key -> value "}:'-DD=" -> icon. I will look for starting characters. When I found one, I'll take the longest substring from there on (max len 7 = "}:'-DD=") and use that as the key into the ImageIcon map, if the map returned null, I would take a substring of one less (6 = "}:'-DD"), grab into the map again, if not found take substring 5 ("}:'-DD"), if not found take substring 4 ("}:'- D") etc. on the first find, this is the ImageIcon, otherwise take the next starting char.
The map contains duplicate ImageIcons of course... several codes (keys) for one and the same image icon.
Thanks for the head start. I hope I'm not wrong... let's see. Have to get a new beer. ;-)
Karsten
Roedy Green - 21 Mar 2008 08:32 GMT >The map contains duplicate ImageIcons of course... several codes >(keys) for one and the same image icon. a variant of that is to classify chars as emoticon and non-emoticon.
Then when you hit an emoticon char, you scan to the next non-emoticon. Then you look up that piece in the middle in your HashMap.
In other words, you look up the longest possibility.
 Signature
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|