Topic: APLX Help : Help on APL language : System Functions & Variables : ⎕SS String Search/Replace
[ Previous | Next | Contents | Index | APL Home ]

www.microapl.co.uk

⎕SS String Search/ReplaceUsing modifier flags to specify how the search should be carried outTechnical considerations


The system function ⎕SS finds one or more text patterns in a character vector, and optionally replaces them with different sub-strings. It can either do a simple search, matching characters one-by-one, or search for regular expressions, which allow you to specify sophisticated pattern-matching rules in a compact form.

The right argument is a nested vector of either two or three elements. The first element is the character vector in which you want to search (we refer to this as the string). The second element contains the pattern or patterns you want to search for. If you are searching for just one pattern, it is a character vector. If you are searching for more than one pattern, it is a vector of character vectors. (Character scalars are treated as length-one character vectors). The optional third element contains replacement sub-strings for search-and-replace operations. The optional left argument determines the type of search.

Monadic form: Simple search

If the right argument is of length two, then ⎕SS returns information about the occurrences of the pattern (or patterns) in the string. The exact information returned depends on the type of search you are doing. In its simplest form, looking for a single pattern in the string, it returns a vector of the positions where the pattern was found in the string:

      TEXT←'Potatoes are bad for you, very bad.'
      ⎕SS TEXT 'bad'
14 32

The pattern 'bad' has been found at character positions 14 and 32. Note: The result returned takes account of the index origin, ⎕IO. All the examples shown assume an index origin of 1.

If you supply more than one pattern, then the simple search returns an N by 2 integer matrix, where each row corresponds to one match. The first column is the position where the match was found. The second column is the pattern number which was found. (Both values take account of the index origin, ⎕IO) For example:

      ⎕SS TEXT ('bad' 'you')
14 1
22 2
32 1

This means that three matches were found. The first match was at position 14, where pattern 1 ('bad') was found. The second match was at position 22, where the second pattern ('you') was found. The third match was at position 32, where pattern 1 was found again.

When used in this monadic form, ⎕SS searches for a match, at a given position, by looking at each of the patterns in turn (in the order in which they appear in the list). When it finds a match, it does not consider any more patterns at that position. Instead, it increments the position by one character, and again looks at each pattern. You can modify the way in which this works by supplying a left argument with modifier flags - see below.

Monadic form: Simple search-and-replace

If the right argument is of length three, then ⎕SS carries out a string-search-and-replace operation, returning the original string but with each occurrence of one of the supplied patterns replaced by a new sub-string specified in the third element. Like the second element, the third element can either be a character vector (in which case all occurrences of any of the patterns are replaced by the same sub-string), or it can be a vector of character vectors. In the latter case, it should be of the same length as the second element, and it specifies a separate replacement sub-string for each of the patterns. For example:

       ⎕SS TEXT 'bad' 'good'
Potatoes are good for you, very good.
       ⎕SS TEXT ('bad' 'you') ('good' 'me')
Potatoes are good for me, very good.
       ⎕SS TEXT ('bad' 'you') '***'
Potatoes are *** for ***, very ***.

Subject to available workspace, the replacement sub-strings can be of any length. A zero-length replacement has the effect of deleting all occurrences of the matched pattern from the returned string. If the replacement sub-strings are longer than the patterns they replace, the returned string will be longer than the original string.

It is important to order the patterns and sub-strings correctly. When doing a search and replace, ⎕SS does a replacement as soon as it finds a match for any of the patterns, considered in the order in which they appear in the list of patterns. After the replacement, the string pointer is incremented past the replacement, and the search starts again.

Dyadic form

The optional left argument is either a single integer, or a pair of integers. The first element determines the type of search, as follows:

Value Search type Result (single search) Result (multiple search) Result (replace)
0 Simple search Integer vector of match positions Nx2 Integer matrix of match position, pattern number Character vector
1 Regular-expression search Nx2 Integer matrix of match position, match length Nx3 Integer matrix of match position, match length, pattern number Character vector
2 Regular-expression extract Nested vector of vectors of sub-strings Nested vector of vectors of sub-strings Not applicable

The second element is an integer which is the sum of a set of flags which modify how the search is carried out, as follows:

Flag Effect Applies to
1 Case-insensitive search All searches
2 Stop after the first match position has been found All searches
4 After finding a match, advance by 1, rather than the length of the match Any search, but no effect on search-and-replace
8 When doing a multiple search, return all matches at all positions, not just the first pattern that matches Any search, but no effect on search-and-replace
16 Treat string as multi-line, so ^ matches start of each line. rather than start of whole string only Regular-expressions only
32 Cause a '.' in the pattern to match new-line delimiters Regular-expressions only
64 Treat only CR as newline Regular-expressions only
128 Treat only LF as newlines Regular-expressions only
256 Treat only CR-LF pairs as newlines Regular-expressions only

For example, if the second element is 5, then the search is case-insensitive and returns all matches at each position, not just the first. These options are described in detail below.

If you omit the left argument, ⎕SS carries out a simple, case-sensitive search which advances the search position by 1 after each match. This is equivalent to a left argument of 0 4.

Specifying the search type using the dyadic form of ⎕SS

Search type 0: Simple search

If the first element of the left argument is zero, ⎕SS carries out a simple search, similar to that of the monadic case, except that you have control over the exact way the search works, by setting options in the second element of the left argument as described below.

Search type 1: Regular-expression search

If the first element of the left argument is 1, ⎕SS carries out the search by treating the pattern (or patterns) as regular expressions. Because regular expressions can match against arbitrary-length sequences in the string, ⎕SS returns the length of each match as well as its position. If you supply only one pattern, the result of a search is an N by 2 matrix, where the first column in the position and the second is the length of the match. For example:

    1  ⎕SS 'A pixel color (or colour)' 'colou?r'
 9 5
19 6

In this example, the '?' meta-character in the regular-expression makes the 'u' optional, so ⎕SS has found a match at position 9 of length 5 ('color'), and a match at position 19 of length 6 ('colour'). (See the separate page on regular expression syntax for details). Contrast this with the simple search, which for a single search pattern returns a vector.

You can also supply multiple regular expressions. In this case, ⎕SS returns an N by 3 matrix, with the pattern number as the third column:

    1  ⎕SS 'A pixel color (or colour), such as light grey' ('colou?r' 'gr[ea]y')
 9 5 1
19 6 1
42 4 2

In this case we have searched for two separate regular-expressions, one which can match 'color' or 'colour', and one which can match 'gray' or 'grey'. The first was found at positions 9 and 19, and the second at position 42.

Note that this is different from supplying a single regular expression which itself can match either set of alternatives:

    1  ⎕SS 'A pixel color (or colour), such as light grey' 'colou?r|gr[ea]y'
 9 5
19 6
42 4

Replacing matched patterns using regular expressions

As with simple searches, you can also do search-and-replace using regular expressions. As soon as a match is found against any of the regular-expression patterns supplied, it is replaced by the corresponding sub-string supplied in the third element of the right argument to ⎕SS. For example, the regular-expression '<[^>]+>' can be used to match against HTML tags (it finds any pattern starting with a left angle bracket, followed by any number of characters EXCEPT a right angle bracket, and then followed by a right angle bracket):

     1 ⎕SS '<P>This is <B>bold</B></P>' '<[^>]+>'
 1 3
12 3
19 4
23 4

The pattern has matched in four places. By providing a replacement string, we can replace any HTML tags found with our own text, or strip them out completely by replacing with an empty vector:

     1 ⎕SS '<P>This is <B>bold</B></P>' '<[^>]+>' '∆'
∆This is ∆bold∆∆
     1 ⎕SS '<P>This is <B>bold</B></P>' '<[^>]+>' ''
This is bold

When replacing regular expressions, you can also extract sub-strings which matched the pattern, and use those in your replacement pattern. Wherever the replacement contains a backslash character '\' followed by a digit 0 to 9, the text which gets replaced is extracted from the matched text in the original string. '\0' always represents the whole of the original part of the string which matched the pattern:

     1 ⎕SS '<P>This is <B>bold</B></P>' '<[^>]+>' '[HTML: \0]'
[HTML: <P>]This is [HTML: <B>]bold[HTML: </B>][HTML: </P>]

'\1' to '\9' are replaced by sub-strings in extracted from the matched patterns. These sub-strings are delimited (in the regular-expression pattern) by being placed in parentheses, and the corresponding part of the matched string is used to replace them. Thus, if '\1' appears in the replacement sub-string, it is replaced by that part of the original string which matched against the first parenthesized part of the pattern:

     1 ⎕SS '<P>This is <B>bold</B></P>' '<([^>]+)>' '[HTML: \1]'
[HTML: P]This is [HTML: B]bold[HTML: /B][HTML: /P]

If there is no such sub-string (for example, if we used '\2' in the above replacement expression), the inserted sub-string would be empty. If you want to include a literal '\' in the replacement string, you need to double it up as '\\'.

Search type 2: Extracting sub-strings from matched regular expressions

If you supply 2 as the search type, ⎕SS again does a regular-expression search. But instead of returning the positions of matches, it returns the text of the matches. This can be used to rapidly extract all the strings which match a certain pattern. For example, suppose you want to extract all valid e-mail addresses from a string, but you don't care where they appear. For example, this regular expression can be used to match against most e-mail addresses, if used in an case-insensitive search: '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'. It matches against sequences which begin at a word boundary ('\b'), and comprise a series of letters, digits and other characters followed by an '@' sign, then have a series of words separated by periods, and end with a two to four letter top-level domain name and a word boundary. If we use it in a case-insensitive search, it will find what we want:

      MAIL←'Try e-mailing bill.gates@microsoft.com or, better, jim@microapl.co.uk'
      1 1  ⎕SS MAIL '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
15 24
52 18

The search has found the two e-mail addresses, one (of length 24) at position 15, and the other (of length 18) at position 52.

By using search-type 2, we can extract the actual strings immediately:

      2 1  ⎕SS MAIL '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
  bill.gates@microsoft.com   jim@microapl.co.uk

What gets returned is a nested vector, with one element per match. Each element in turn is a nested vector of character vectors. The first is the whole matched pattern, i.e. the equivalent of '\0'. There then follows, for each match, the extracted sub-strings for any parenthesized sub-patterns in the regular expression, i.e. the equivalent of '\1', '\2' etc.

In our example above, there are no parenthesized sub-patterns, so we only get one character vector for each match:

    ⎕DISPLAY 2 1  ⎕SS MAIL '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
┌→────────────────────────────────────────────────────────┐
│ ┌→───────────────────────────┐ ┌→─────────────────────┐ │
│ │ ┌→───────────────────────┐ │ │ ┌→─────────────────┐ │ │
│ │ │bill.gates@microsoft.com│ │ │ │jim@microapl.co.uk│ │ │
│ │ └────────────────────────┘ │ │ └──────────────────┘ │ │
│ └∊───────────────────────────┘ └∊─────────────────────┘ │
└∊────────────────────────────────────────────────────────┘

(Caution: There is one more level of nesting than you might expect here.) So far, this does not do anything more than you could do by extracting the matches yourself from the match positions and lengths. But if you use parentheses in the pattern, you can use this mechanism to extract specific parts of the matches. For example, if we put parentheses around the parts of the pattern before and after the '@' sign, we can easily distinguish the name part from the server/domain part of the address. Each extracted match will comprise a length three-vector:

    ⎕DISPLAY 1⊃¨2 1  ⎕SS MAIL '\b([A-Z0-9._%+-]+)@([A-Z0-9.-]+\.[A-Z]{2,4})\b'
┌→────────────────────────────────────────────────┐
│ ┌→───────────────────────┐ ┌→─────────────────┐ │
│ │bill.gates@microsoft.com│ │jim@microapl.co.uk│ │
│ └────────────────────────┘ └──────────────────┘ │
└∊────────────────────────────────────────────────┘
    ⎕DISPLAY 2⊃¨2 1  ⎕SS MAIL '\b([A-Z0-9._%+-]+)@([A-Z0-9.-]+\.[A-Z]{2,4})\b'
┌→───────────────────┐
│ ┌→─────────┐ ┌→──┐ │
│ │bill.gates│ │jim│ │
│ └──────────┘ └───┘ │
└∊───────────────────┘
    ⎕DISPLAY 3⊃¨2 1  ⎕SS MAIL '\b([A-Z0-9._%+-]+)@([A-Z0-9.-]+\.[A-Z]{2,4})\b'
┌→─────────────────────────────────┐
│ ┌→────────────┐ ┌→─────────────┐ │
│ │microsoft.com│ │microapl.co.uk│ │
│ └─────────────┘ └──────────────┘ │
└∊─────────────────────────────────┘

You can also supply multiple regular-expression patterns when using this form. In this case, the result is of the same form; for each match, you get the extracted sub-string or sub-strings for the pattern which matched.


You can modify the way the search is done by providing the second element of the left argument as the sum of various modifier flags. These work as follows:

Case-insensitive search (modifier 1): By default, ⎕SS does a case-sensitive search. If you supply a modifier flag of 1 (for either a simple or regular-expression search), the search will be case-insensitive. This applies both to the ordinary unaccented letters (A to Z and a to z), and to accented letters, so that for example È will match against è

      TEXT←'Potatoes are bad for you, very bad.'
      0 ⎕SS TEXT 'potatoes'        ⍝ Case-sensitive search, empty result

      0 1 ⎕SS TEXT 'potatoes'      ⍝ Case-insensitive search, one match
1

Stop at first match position (modifier 2): This flag causes the search to terminate after the first match position has been found. In a search operation, this normally means that only one match will be returned, but if you have also set flag 8, it is possible that more than one match will be returned because more than one pattern matches at the first hit position. For a replace operation, this flag has the effect that only the first matched pattern is replaced.

      0 ⎕SS TEXT 'bad'
14 32
      0 2 ⎕SS TEXT 'bad'
14
      0 ⎕SS TEXT 'bad' 'good'
Potatoes are good for you, very good.
      0 2 ⎕SS TEXT 'bad' 'good'
Potatoes are good for you, very bad.

After a match, advance the search position by 1 (modifier 4): Without this flag, once it has found a match at a given position, the dyadic form of ⎕SS advances the starting position for the next search to beyond the end of the matched pattern, and starts searching from there, so that any given position in the string can appear within at most one match. If you set modifier flag 4, it will instead advance the search position by 1. This means that it may find further matches which overlay the first match, either from the same pattern or a later pattern in a list of multiple patterns. Consider this example:

      COMPLAINT←'Even my sandwich was sandy.'
      0 0 ⎕SS COMPLAINT ('sand' 'sandy' 'and')
 9 1
22 1

In this example, without the modifier flag 4, the first match it finds is against the pattern 'sand' at position 9. It then increments the search position to point at the 'w' of 'sandwich', and continues searching. It finds 'sand' again at position 22, and then starts searching from the final 'y'. As a result, it does not return a match for 'and' at positions 10 and 23. In addition, it can never return a match for 'sandy', because it will always find a match for 'sand' first at any position which otherwise would match 'sandy'.

By setting modifier 4, you can cause ⎕SS to find further matches of other sub-patterns starting within matches already found, in this case the 'and' within 'sand':

      0 4 ⎕SS COMPLAINT ('sand' 'sandy' 'and')
 9 1
10 3
22 1
23 3

Find all matching patterns everywhere in the string (modifier 8): Normally, once it has found a match at a given position, ⎕SS stops looking for any further matches at the same position against any later patterns in the list. Thus, in the previous example, it still did not regard 'sandy' as a match at position 22. If you set modifier flag 8, it will also see whether any of the other patterns in the supplied list also match at the current position (because they start with the same prefix), and it will then advance the search position by 1. The effect is to find every possible independent match, including both 'sand' and 'sandy' at position 22:

      0 8 ⎕SS COMPLAINT ('sand' 'sandy' 'and')
 9 1
10 3
22 1
22 2
23 3

Note: Modifiers 4 and 8 have no effect if you are doing a search-and-replace operation, because the next search point is always advanced to the character after the replacement. For upwards compatibility with previous versions of APLX and APL.68000, the monadic form of ⎕SS acts as though modifier 4 were set.

Modifiers 4 and 8 can be used with regular-expression searches, but the results may be very strange if you use repeat operations in the pattern. For example, using our e-mail matching pattern:

    2 5  ⎕SS MAIL '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
  bill.gates@microsoft.com   .gates@microsoft.com   gates@microsoft.com   jim@microapl.co.uk

Because the '.' in 'bill.gates' is valid as a word delimiter, this has matched at three separate places within the same e-mail address, which is probably not what you want!

Treat string as multi-line, so ^ matches start of each line (modifier 16): This flag applies only to regular-expression searches. It is relevant when the string being searched contains multiple, independent lines of text delimited by newline characters. If you use a pattern containing the '^' or '$' metacharacters (match start or end of line), this flag will cause the match to occur on every line, not on the whole string. For example, suppose we have a carriage-return delimited list of names. If we don't set this flag, then the string is considered as a single piece of text:

      Composers←'Ludwig Van Beethoven',⎕R,'Richard Wagner',⎕R,'Gustav Mahler'
      Composers
Ludwig Van Beethoven
Richard Wagner
Gustav Mahler
      1 1 ⎕SS Composers '^[A-Z]+'
1 6
      1 1 ⎕SS Composers '\b[A-Z]+$'
44 6
      2 1 ⎕SS Composers '^[A-Z]+'
  Ludwig
      2 1 ⎕SS Composers '\b[A-Z]+$'
  Mahler

The '^' and '$' meta-characters have matched only at the start and end of the whole string. By adding modifier flag 16, we can treat the three lines individually:

      1 17 ⎕SS Composers '^[A-Z]+'
 1 6
22 7
37 6
      1 17 ⎕SS Composers '\b[A-Z]+$'
12 9
30 6
44 6
      2 17 ⎕SS Composers '^[A-Z]+'
  Ludwig   Richard   Gustav
      2 17 ⎕SS Composers '\b[A-Z]+$'
  Beethoven   Wagner   Mahler

By default, any of CR, LF, or CR-LF pairs are treated as newlines, so text originating on Windows, MacOS and Unix-style systems can be handled automatically. You can force the search to treat only one of these possibilities as a valid newline by including modifier 64 (CR only), 128 (LF only) or 256 (CR-LF pairs only).

Cause a '.' in the pattern to match new-line delimiters (modifier 32): Normally, the period ('.') wild-card character does not match a newline character. This flag forces it to do so.

Performance

For simple searches with a single search pattern, ⎕SS uses the Boyer-Moore-Horspool algorithm, which is extremely fast for most combinations of strings and patterns. Long patterns are generally faster to search for.

For simple searches with multiple search patterns, ⎕SS uses a simple search algorthm with optimizations, so performance should remain good for reasonable numbers of search patterns.

Case-insensitive searching will be slightly slower than case-sensitive searches. Regular-expression searching is slower than the equivalent simple searches.

Regular-expression search engine

The regular-expression search engine used in APLX is based on version 7.1 of Perl Compatible Regular Expressions (PCRE), an open-source product written by Philip Hazel of the University of Cambridge. See the Legal Notice which applies to PCRE.

The maximum length of string which can be searched using regular expressions is 2GB (on 64-bit versions of APLX, simple string searches can be done on strings longer than 2GB).


Topic: APLX Help : Help on APL language : System Functions & Variables : ⎕SS String Search/Replace
[ Previous | Next | Contents | Index | APL Home ]