08 April 2010

Removing all accents (and other diacritics) from a String

Valid since: op4j 1.0

Description
Remove all diacritics common in European languages from the characters in a String, converting the text into an ASCII-compatible String. A common operation, for example, in text search comparison scenarios.

Scenario
Our conts variable is an array containing the names of the Earth's continents in Castilian Spanish language:
//conts == ARRAY [ "África", "América", "Antártida", "Asia", "Europa", "Oceanía" ]
...and, knowing that our users might forget to input accents in our application, we need to strip all the accents from those texts so that searches are not influenced by their bad ortography:
//conts == ARRAY [ "Africa", "America", "Antartida", "Asia", "Europa", "Oceania" ]

Recipe
Use the op4j asciify() function in the FnString function hub class, which is able to transform accented characters(and also other diacritics) into their non-accented equivalents.

Also, this function will have to be applied to each element of the array, and so a map(..) action will be needed for executing the function.

conts = Op.on(conts).map(FnString.asciify()).get();

This will be of course less verbose -but equivalent- than:

conts = Op.on(conts).forEach().exec(FnString.asciify()).get();

Comments
But, what if we also wanted an uppercase output? Well, just throw in the FnString.toUpperCase() function:
conts = 
    Op.on(conts).forEach().exec(FnString.asciify()).exec(FnString.toUpperCase()).get();
Or equivalently, we could chain both functions into one:
conts = 
    Op.on(conts).map(FnFunc.chain(FnString.asciify(), FnString.toUpperCase())).get();

No comments:

Post a Comment