09 April 2010

Extract some text from a String using a regular expression

Valid since: op4j 1.0

Given a String, apply a regular expression (containing some group definitions) to it and, if it matches, extract one of the matched groups into a different String.

Our books variable is an array containing some data about books, extracted from a text file. Our data look like this:
// books == ARRAY [ "Title=The Origin of Species; Price=24.90EUR",
//                  "Title=Odyssey; Price=13.50EUR",
//                  "Title=A Midsummer Night's Dream; Price=18.20EUR" ]
But we are only interested on the titles of those books, and we would like to create a String[] titles variable containing:
// books == ARRAY [ "The Origin of Species",
//                  "Odyssey",
//                  "A Midsummer Night's Dream" ]
For doing so, we define the following regular expression:
// regex == "Title=(.*?); Price(.*)"

We should iterate on our books array and apply on each element the FnString.matchAndReplace(...) function, which will apply our regular expression and let us decide which group we want to extract from it (in this case, group number 1):

String[] titles = 
    Op.on(books).forEach().exec(FnString.matchAndExtract(regex, 1)).get();

...which is in fact equivalent to:

String[] titles = 
    Op.on(books).map(FnString.matchAndExtract(regex, 1)).get();

Let's look instead at a much more complex (and powerful) example: Imagine that we did't need only the book titles, but instead we wanted to create a map for each book containing an entry for each piece of data ("Title" and "Price"). Something like this:
// bookInfo == LIST [ 
//                    MAP [
//                          "Title"="The Origin of Species"
//                          "Price"="24.90EUR"
//                        ],
//                    MAP [
//                          "Title"="Odyssey"
//                          "Price"="13.50EUR"
//                        ],
//                    MAP [
//                          "Title"="A Midsummer Night's Dream"
//                          "Price"="18.20EUR"
//                        ]
//                  ]
What would we need to get this starting from our books variable? First, define a new regular expression able to extract both keys and values:
// regex == "(.*?)=(.*?); (.*?)=(.*?)"
And now let's see the steps:
  1. Convert the array into a List.
  2. Iterate it. For each element:
    1. Apply regular expression and extract groups 1, 2, 3 and 4 into a List<String>.
    2. Couple the four elements in the resulting list into two map entries each so that element 1 is key for the first entry, 2 is value for the first entry, 3 is key for the second entry and 4 is value for the second entry.
The functions involved will be:
// FnString.matchAndExtractAll(String regex, int... groups) : Function<String, List<String>>
// FnList.ofString().couple() : Function<List<String>, Map<String, String>>
Let's have a look at the resulting code:
List<Map<String,String>> bookInfo = 
        exec(FnString.matchAndExtractAll(regex, 1,2,3,4)).
Much easier to write than to think of!

No comments:

Post a Comment