Virtual OS/2 International Consumer Education
VOICE Home Page: http://www.os2voice.org
January 2004

Newsletter Index
< Previous Page | Next Page >
Feature Index

editor@os2voice.org


DrDialog, or: How I learned to stop worrying and love REXX - Part 11

By Thomas Klein © January 2004

Welcome back to our series on programming with REXX and DrDialog. I had to take a break in order to get other things done that piled up behind me. Sorry for making you wait but finally, here we are back. As the last article dealt with loops, there is one addendum I would like to make to that subject:

Make sure to never mess with the loop (or counter) variable manually!
While this sometimes is used in other languages to invoke a direct exit out of the loop, such "tricks" should be avoided as it could lead to unpredictable behavior in some circumstances. Rather use the EXIT statement provided by REXX or think about using a different structure for your loops.

Today we'll talk about REXX's wealth of functions for working with strings. We won't be complete on that subject because some of the functions are that specific that you might hardly ever need them. As you go on writing your own stuff, you'll find some of the functions will always be part of your code while others won't be. It largely depends upon your approach to solve a specific problem. In order to provide a structured overview, I came up with grouping them by functions for...

At the end of the article, you might wonder what happened to the PARSE keyword/function. Well, PARSE is worth an article itself I guess. This is one of the most powerful parts of REXX - both in matters of functionality as well as complexity. We'll have one article dealing with the basic use of PARSE at a later moment. There is much more to PARSE than what we will be dealing with in our series of course, but as this series is intended to address beginners in REXX as well, I don't think it would be such a great idea to confuse you by going into details. Sure, knowing all of PARSEs behavior might provide you with powerful means to solve your programming issues, but most of the time you'll only have to deal with a basic subset of it and this is what we'll be dealing with as well.

Obtaining information about strings

When dealing with strings, it's very useful to know something about them before messing around.

LENGTH will tell you the amount of bytes (or characters) that a string contains. Note that this also includes leading and trailing blanks:

/* length sample */
text = " I'm 97 years old.  "
say length(text)

If run, this script would print 20.

VERIFY is used to check whether a string contains specific characters or not. To accomplish this, you need to specify the string to be checked, a second string which holds the "comparison characters" and additional options. For example, you could check if a string is a valid phone number - that's to say it shall only contain the digits 0 through 9, blanks, dash and plus sign (for international dial prefix). The comparison string thus would look like

"0123456789 -+"

Now VERIFY can be used to either check if the string ONLY contains one of these characters or whether it contains NONE of these.Actually, VERIFY will return the position of the first character in the test string which does or doesn't match with any of the characters in the comparison string. A little confusing at first, hm? Here we go:

/* VERIFY sample */
matchstring = "0123456789 -+"
PARSE PULL phone
if VERIFY(phone, matchstring, "NOMATCH") = 0 then
   say "The phone number is okay."
else
   say "Phone number contains invalid characters."

In the above example, if one would enter +44-123 456 / 789 VERIFY would actually return 13 because in the test string, character number 13 ("/") is not part of the comparison string. Thus, VERIFY used with the NOMATCH parameter will tell which character does NOT MATCH the comparison string characters. If a 0 (zero) is returned, this means that there is no character that doesn't match, thus the number is "okay" we might say.
If "MATCH" is used instead, VERIFY well tell you the position of the first character in the test string which MATCHES with those in the comparison string. It depends on what's easier to code but most of the time, you might prefer the NOMATCH parameter for it's easier to read and understand program logic.
Some notes about VERIFY: The full syntax is:

result = VERIFY ( <test string>, <comparison> [, "MATCH" | "NOMATCH" ] [, START] )

Only the first letter of the "MATCH" or "NOMATCH" parameter is required and can be either upper or lower case. And there's an additional parameter of START which tells the position in the test string from where to check comparison. By default, comparison starts with the first character in test string but depending on what you might construct as test string, you might want to skip comparison of a certain number of leading characters. If your program uses a specific concatenation of strings for you address book for example, this might result in something like "Doe,John/555-6780". In order to test if the number is valid, you should tell VERIFY to start at position 10 by coding

if VERIFY(phone, matchstring, "NOMATCH",10) = 0 then

or by using the abbreviated version

if VERIFY(phone, matchstring, "N", 10) = 0 then

As "Nomatch" is the default for the comparison type, you could even just omit it. In this case, if you want to supply the optional parameter of START, you'll still have to use the additional comma  for the parser to understand that you actually omitted the comparison type parameter:

if VERIFY(phone, matchstring, ,10) = 0 then

In case you just want to check right from the start using "Nomatch", you could omit the whole rest and type

if VERIFY(phone, matchstring) = 0 then

For the example of John Doe you might wonder how to tell the position for START if the name changes. Good point. We'll use another function that'll be discussed in a few moments. But for now, let's have a look at the last informational string function.

WORDS is a very useful function. The whole concept of "words" in strings feels like heaven to you if you're used to program in BASIC for example. A WORD is a subpart of a string which is either enclosed in spaces or the begin/end of the entire string respectively. Amongst those WORD-functions, WORDS is quite simple: It returns the number of words found in a string:

/* words example */
text = "This is a words() sample. "
say words(text)

WORDS would recognize the following substrings:  This   is   a   words()   sample.
Thus, WORDS in the above example would return a value of 5.
In order to explain what WORDS are about let me put it this way: If YOU look at a string, there's words that you recognize, right? The WORDS() function quite exactly works the same way. As long as there is at least one space between strings, they will be recognized as two words. It doesn't matter if there's - let's say - twenty spaces between them. It's still two words. Exceptions are that you might not recognize a single full stop as a word but WORDS() does if the dot is separated from the rest by spaces like in
"There   were  57 channels and   nothing  on  . " (note that separated "." at the end). WORDS() would recognize 8 words.

Substrings and search functions

Among string functions, these are the ones used most - at least for me. Let's start with a very basic one:
POS is used to find the starting position of a string within another string. IBM uses the "needle and haystack" method to explain the syntax - that's a quite good way of memorize the syntax scheme:

result = POS( <needle>, <haystack> [, START])
POS searches the <haystack> string for the first occurrence of <needle>. It either returns the starting position (the character number, starting with 1 for the first character) or ZERO if the <needle> wasn't found in the <haystack>. Optionally, you can tell POS not to start its search from the first character of <haystack> but from a different position. This is useful for identifying special substrings - although the WORD-functions described later do a much better job here.
As an example, imagine that you have a string named "record" containing contact data such as
"firstname=Peter, lastname=Jones, phone=555-12345"

and you want to retrieve the phone number. Assuming that 'phone number' is always the last entry of the contact data string, you would go like this:

phonestart = POS("phone=", record)

If phonestart contains something else than zero, it means that the string was found. Next, you skip 6 characters (the length of 'phone=') and you know the starting position of the actual number. Next, you would determine the length of the string according to the entire length of 'record' in order to retrieve the number from the string. But this requires an additional functions (substr) described later.
Personally, I use this function most of the time to check whether a string actually is contained within another string or not - regardless of where it actually is like in:

IF POS("/?", parameterstring) \= 0 then call DisplayHelp

...which means "if 'parameterstring' contains a '/?' then call a certain function (used to display the command syntax)"

LASTPOS does quite the same, except that it searches the <haystack> backwards. It uses the same options (START) and the same return value. LASTPOS is the convenient way to make sure you find the LAST occurrence of <needle> within the haystack. Of course, you could achieve the same by using a loop of POS calls that subsequently START by the last found position, but hey: Why worry?
Personally I use LASTPOS mostly when dealing with file names that include drive and path information (so-called "fully qualified file names"). Once I know the position of the last "backslash" character, I know that everything else "behind" it must be the actual file name and - vice versa - the preceding part is drive and path. Yes, I could use the FILESPEC() function as well, but depending on the program needs, sometimes you might need to refer to such data...

The WORD-functions (word, wordpos, wordindex, wordlength and subword) are extremely useful when dealing with parts of strings that are separated by one ore more blanks. If you ever tried to identify such parts "by hand" like in vintage BASIC dialects or other programming languages that lack such functions, you might agree that REXX feels like "programmer's heaven" ;)
As an example for the following set of functions, let's assume that you have a string named "input" containing an unknown amount of parts (or "words") separated by an unknown amount of blanks... for example "Mary  has    5  little lambs."

WORDS (as already discussed above) will tell you the amount of "parts" (or "words") that are contained in the string.

SAY WORDS(input)

would display 5
WORD is used to retrieve a single word from a string and "cleaning" it by removing both leading and trailing blanks. In order to achieve this, you must tell WORD which word to retrieve by specifying a "word number" (1 for the first, 2 for the second and so on...). Thus

SAY WORD(input, 2)

would display has.

WORDPOS works just like POS described above - except, that it doesn't deal with character positions but words: It searches <haystack> for the first occurrence of <needle> and returns the number of the word that matches <needle>. Just like POS, an optional parameter can be used to make WORDPOS start from a "later" position than the first word. Again, in WORDPOS this refers to a word number.
The syntax is

result = WORDPOS(<needle>, <haystack> [ ,START ] )

Note that <needle> and <haystack> must match exactly for WORDPOS to function correctly - that means, the case of characters must match as well.

SAY WORDPOS("HAS", input)

would give you 0
because "HAS" is not equal to "has", while

SAY WORDPOS("has", input)

would result in 2

Another fact worth mentioning is that you can use more than one "word" for the <needle>. In this case, WORDPOS treats the <needle> contents the same way as all WORD-functions treat the <haystack>: The contents are internally parsed into words. Thus

SAY WORDPOS("has 5", input)

will display 2
as well, although 'input' contains "Mary  has    5  little lambs." (which shows 4 spaces between 'has' and '5') while the '<needle>' uses only 1 space. By internally parsing both needle and haystack into separate words, the match applies...

WORDINDEX is used to get the starting position of a certain word within the entire string - that's to say including all leading characters, even if they are blanks.

SAY WORDINDEX(input, 2)

would display 7

The SUBSTR function returns you a part of a string, specified by starting character number and length.

SAY SUBSTR(input, 3, 17)

for example will return you the part of 'input' that starts with the 3rd character and is 17 characters long - which results in ry  has    5  lit
being displayed. In case that you're familiar with BASICs "MID$"-function, note that SUBSTR cannot be used to set/change subparts of a string, but only to retrieve them. Optionally, SUBSTR can be told to fill up "non-existent parts" of the substring to retrieve with a specified character. "Non-existent" in this case refers to a substring that is longer than the actual string. Example? If your programs retrieves the characters number 3,4 and 5 of a string and you accidentally pass it a string of 3 characters only, you won't get an error message. Instead, you will only receive character number 3 along with two spaces - because by default (if no explicit padding character was specified) blanks are used. If you use the optional "padding" character, you'll get character number 3 and two padding characters returned:

SAY SUBSTR(input, 25, 10)

would display "ambs.     "
(without the quotes - they're only used by me to show the trailing blanks)

SAY SUBSTR(input, 25, 10, "-")

would display "ambs.-----"
A handy feature of SUBSTR is that if you don't specify a length operand, it'll return you the entire rest of the string starting from the specified position:

SAY SUBSTR(input, 17)

would give you little lambs.

The two string functions that I use in almost each program are left and right. They're used to retrieve a substring in a given length of characters from another string. This can be achieved by either starting from the right or the left boundary of the string - according to how the function is called respectively:

SAY LEFT(input, 7)
will display "Mary  h" whereas
RIGHT(input, 7)
will display " lambs." respectively - both (again) without the quotes of course.
Just like SUBSTR, both left and right will use spaces for padding of non-existent parts (beyond the start/end of the string) if you don't explicitly specify another character for padding like in
SAY LEFT("abcdefg",10,"-")
This would display abcdefg---

SUBWORD acts in a similar way to SUBSTR. Besides the fact that it deals with words instead of characters, there are quite some more differences though: There is no padding for "exceeded" parts like in SUBSTR. Remember that input contains 5 "words". If we try to retrieve words 4, 5 and 6 from input by

SAY SUBWORD(input, 4, 3)
it would simply give us "little lambs."
(again, without the quotes - I just use them here to show that the returned value does not contain trailing blanks...)
Another fact worth mentioning is, that SUBWORD returns the separation blanks exactly the way they're contained in the original string - that's to say, there is no internal parse that removes additional separators:
SAY SUBWORD(input, 1, 3)
thus will display "Mary  has    5"
Just like with SUBSTR the entire rest of the string is passed if no length (amount of words) was specified.

WORDLENGTH finally tells you how much characters a word in a string is made up of:

SAY WORDLENGTH(input, 3)
would display 1

Creating or transforming strings

Besides creating strings from subparts of other strings, there are of course more ways to do so. Considering string "transformation" I must admit that we actually don't really "transform" strings but rather create new ones from existing ones. Sometimes, we might re-assign them directly back to the source string variable like in
mystring = left(mystring, 5)
but basically we don't transform a string. But this is not so important right now - let's conclude the article.

COPIES creates a string by concatenating multiple copies of a specified string:

SAY COPIES("bla", 3)
for example would display blablabla
Great. COPIES is quite useful for example when you might want to do separator lines in VIO mode that have to be of a specific length:
SAY COPIES("-", 18)
will give you ------------------

XRANGE is useful e.g. for being prepared to deal with character translation. As you might now, each character has an "index number" within the character table. We call that "ASCII table". XRANGE makes use of these numbers and creates a string that consists of a consecutive row of characters (according to the table sequence) by taking into account both start and end characters:

myalphabet = XRANGE("a", "z")
SAY myalphabet
will display abcdefghijklmnopqrstuvwxyz
Note that the ASCII table contains 256 entries (from #0 to #255). If you want to display the whole table, you'll have to use hex notation because both #0 and #255 contain non-printable (thus non-"enterable") characters:
SAY XRANGE("00"x, "FF"x)
will display the entire ASCII table contents (as far as the entries are printable characters...).
Note as well, that if the end value is smaller than the start value (according to the table sequence), XRANGE will start with the start value, display every entry to 255, then restart with 0 and display every entry up to the end value. Thus, you won't get the "reverse" range but rather something you did not expect.

We already know STRIP from a previous example: It removes leading and/or trailing characters from a string. Or, like I said above, it rather creates a new string that was removed those characters. By default, it removes spaces but can be used for other characters as well. Optionally, you can also specify what type (leading, trailing or both) to remove. The default is both.

SAY STRIP("  Mary.  ")
will return (display) Mary.
This is because none of the optional parameters was specified which defaults to "remove leading and trailing spaces".
SAY STRIP("0012.850", "L", "0")
will give you 12.850 while
SAY STRIP("0012.850", , "0")
will display 12.85
Note that again we need to specify the comma in order to make sure that REXX'S parser understands that we actually omitted the first optional parameter and that "0" is the character we want to remove. Writing
SAY STRIP("0012.850", "0")
would result in an error, because "0" will be interpreted to be the leading/trailing parameter - which is not valid, but only "L", "T" or "B".

INSERT appears to be quite complex at first sight. The full syntax diagram is

result = INSERT ( <what> , <into> [, START ] [, LENGTH ] [, PAD ] )
Basically, it inserts a string into another string by using a specified character position:
SAY INSERT("123", "abcde", 3)
will display abc123de
Note that the START parameter defaults to ZERO which means, that <what> will be put "in front" of <into>:
SAY INSERT("123, "abcde")
will display 123abcde
As long as you're happy with the defaults, there's nothing to take care about. If you wish to have some more features, you'll need to know what LENGTH and PAD will do to the functions behavior... LENGTH is used to fill up the <what>-string to a given length before inserting it. By default, spaces will be used for filling (or "padding")...
SAY INSERT("123", "abcde", 3, 5)
will display abc123  de
However, if you specified a padding character in PAD, the <what> string will be filled with that character instead of spaces like in:
SAY INSERT(123, "abcde", 3, 5, "#")
will display abc123##de

DELSTR removes a substring from another string. It uses quite the same parameters like SUBSTR - the starting character position and length:

SAY DELSTR("abcde", 3, 2)
would display abe
Again, just like SUBSTR, a missing length operand equals "all the rest":
SAY DELSTR("abcde", 3)
would thus display ab

DELWORD is the equivalent counterpart to DELSTR when dealing with words. We'll return to Mary and her lambs to show how it works:

SAY DELWORD(input, 2, 2)
will remove two words from input, starting with word number two: Mary little lambs.
DELWORD does not internally parse the contents - this means, that additional spaces between the cutting edge words will not be removed, but exactly only the words with their limiting spaces plus all spaces "behind" the last word to remove. To show this in detail:
SAY DELWORD("abc   def  ghi   jkl", 2, 2)
will result in abc   jkl
Why is there a 3-blanks space between the words? Because "def   ghi" are the two words to remove. Plus the three trailing blanks of "ghi" - thus, it's "def   ghi   " that's cut from the string. The remaining parts are "abc   " in front of it and "jkl" behind it. Glue them together and you'll have "abc   jkl".

CENTER is some kind of special "flavor" of INSERT. It'll center a string within a new empty string of a given length and can be told to fill up the boundary parts with a special character. This is great for doing headlines in VIO mode for example:

SAY CENTER("Mary's lambs", "20", "-")
will display ----Mary's lambs----
Funny thing. Here's a little program to show you some possible use of COPIES and CENTER:
/* the mary sample */
lambname.0 = 5
lambname.1 = "itchy"
lambname.2 = "sparky"
lambname.3 = "joey"
lambname.4 = "samantha"
lambname.5 = "lou"
say center("Mary's lambs list", 30, "-")
do i = 1 to lambname.0
     say center(lambname.i, 30)
end
say copies("-", 30)

this nifty little program will display...:
------Mary's lambs list-------
            itchy
            sparky
             joey
           samantha
             lou
------------------------------
Great, huh?
You might wonder what happens if you tell it to center a string into something smaller than the string itself right? Naah, no errors - just truncations:
SAY CENTER("This is not funny!", 7)
will display is not
Whenever even and odd numbers of characters are involved in center (thus, a "balanced" centering with equal boundaries is not feasible), the right boundary will be added or removed a character in order to make the string fit to the length specified.
Another fact mentioning is that this function can equally be called using "center" as well as "centre". Now, this is what I call "IBM quality".

REVERSE is nothing tremendously abstract: It simply gives you the reverse notation of the string passed. If you ever wanted to know what your first name is looking "the other way round", give it a try with:

SAY REVERSE("thomas")
You must replace 'thomas' with your first name in order to make the program function correctly. ;) Except, of course, your first name is Thomas too.

SPACE is great when dealing with words. It can be used to make words spaced with the same amount of characters. Did you ever try to first get the "words" out of a string, then put them together, separated by one space each? This can be done with a single command in REXX:

SAY SPACE(input)
will display Mary has 5 little lambs.
Of course, we did again let the defaults save us work. Actually, we would have to write
SAY SPACE(input, 1, " ")
to make it understand that we want the words to be separated by 1 blank each. Why not separate them by two underscores each:
SAY SPACE(input, 2, "_")
would display Mary__has__5__little__lambs.
As you might have understood already, SPACE uses internal parsing of course to "get the words right". This is a great function for "normalizing" user input or data from other programs if you need it to be in a special way... note that if you use 0 as the amount of separation, all blanks will be removed from the string:
SAY SPACE(input, 0)
will thus display Maryhas5littlelambs.

I must admit that until today, I didn't ever mess with OVERLAY. After looking into what it's used for... well, I might mess with it in the future.
What overlay actually does is working like INSERT (it even uses the same syntax and parameter list) except that - let me put it that way - it uses "overwrite mode" instead of "insert mode" while typing its text... know what I mean?

SAY OVERLAY("=XYZ=", "01234567890", 4)
will display 012=XYZ=890
If you make use of the optional parameters then let's look at the syntax scheme first:
result = OVERLAY ( <what> , <into> [, START ] [, LENGTH ] [, PAD ] )
If you specify a length parameter, <what> will be padded with PAD characters to the specified length. The default for PAD is blanks. Thus,
SAY OVERLAY("=XYZ=", "01234567890", 4, 6)
will display 012=XYZ= 90
The default value for START is 1, which means that <into> will be overwritten right from the first character.

Finally, TRANSLATE is another cool function for messing with strings. With TRANSLATE, you set up two tables of characters which are used to replace the characters of a string. For each character in the string, TRANSLATE looks it up in the "input" table, then replaces it with the corresponding character in the "output" table. Both tables are just strings with characters, where the correspondence is derived from the character position within the string. That's to say that character #1 in the input table corresponds to character #1 in the output table and so on...
The syntax scheme looks like this:

output = TRANSLATE( <input> [, <output-table>] [, <input-table>] [, PAD] )
The PAD character will be used to fill up the <output-table> if it's size is smaller than the one of <input-table>. If no PAD is specified, blanks are used by default. This is to ensure that there is a match in <output-table> for each entry of <input-table>. You might wonder what happens if all optional parameters are omitted. What happens to <input> then? Quite simply: It will be translated to upper case only:
SAY TRANSLATE("hello")
thus displays HELLO
Characters of <input> which are not found in <input-table> will be left as they are. This is a good way of getting rid of special characters: Simply replace them by spaces, then remove all spaces out of the string using SPACE. For example you might want to remove all vocals out of a string:
outstring = TRANSLATE("hello", copies(" ", 5), "aeiou")
This will set up an <input-table> which contains all vocals and an <output-table> which contains five spaces. Thus, each vocal found will be replaced with a space. This will give us "h ll " as output-string. Next, we'll use the SPACE function with a separation amount of zero:
SAY SPACE(outstring, 0)
which would display hll
Or to put it in one line of code:
SAY SPACE(TRANSLATE("hello", copies(" ", 5), "aeiou"), 0)
This might not be the perfect example. Just imagine that you have a multi-line text entry field and want to count the lines in it. You only need to replace everything with spaces except for "0D"x (which is "LF", "line feed"), then strip all spaces off by SPACE() and get the length of the remaining string. You're done. This method can even be used for counting lines in a text file, once you read the entire file into a variable. I didn't believe how d**n fast this is compared to what I coded so far... until I tried on my own. Want to give it a try? Make sure your text files are not too large (I tried with up to approx. 52K).
/* line count sample */
fname="c:\temp\testfile.txt"  /* change to your needs */
tablein = xrange("00"x, "FF"x) /* entire ascii charset, 256 bytes */
tableout = copies(" ", 13) || "0D"x || copies(" ", 242) /* 13 spaces + LF + 242 spaces = 256 bytes) */
fchars=charin(fname, 1, chars(fname)) /* read whole file into variable */
result = TRANSLATE(fchars, tableout, tablein) /* leave LFs only */
say length(space(result, 0)) + 1

Another simple example is to make use of TRANSLATE and REVERSE to do a basic text "encryption":
/* text crypt sample */
/* warning: this is a stupid way of encryption and not safe! */
intab = xrange("a", "z")
outab = reverse(intab)
enc = translate("thomas", outab, intab)
say enc
dec = translate(enc, intab, outab)
say dec
What it does it using the lower-case alphabet as input-table and the reverse of it as the output-table.
The decryption is done by using the reverse order of tables.
Have fun with it.

The next part will take us on a short tour about the most interesting (and useful) "helper" functions of REXX. Thanks for your patience; Stay tuned!


Feature Index
editor@os2voice.org
< Previous Page | Newsletter Index | Next Page >
VOICE Home Page: http://www.os2voice.org