Tokenizing PHP

PHP allows for a simple model for tokenizing a string. Certain characters, of your choice, are considered separators. Strings of characters between separators are considered tokens. You may change the set of separators with each token you pull from a string, which is handy for irregular strings—that is, ones that aren't simply comma-separated lists.

Listing accepts sentence and breaks it into words using the strtok function. As far as the script is concerned, a word is surrounded by a space, punctuation, or either end of the sentence. Single and double quotes are left as part of the word.

Tokenizing a String

Tokenizing a String

Tokenizing a String

Tokenizing a String

Notice the addition of <END> to the input variable. This special token allows the algorithm to detect the end of the input string. When strtok encounters the end of input, it returns FALSE, so your first inclination might be to test for FALSE in the for loop. Recall that an empty string is considered equivalent to FALSE. If two separators follow each other, strtok will return an empty string, as you'd expect. Since we don't want to stop tokenizing at the first repeated separator, we place a token at the end that we know won't appear in the input. If we're worried about people purposely putting <END> in the input string, we could strip it out first, but this isn't something that will be typed by accident. Since there's no security risk to the tokenizing ending too soon, I prefer to let hackers get invalid results.

The strtok function is useful only in the most simple and structured situations. An example might be reading a tab-delimited text file. The algorithm might be to read a line from a file, pulling each token from the line using the tab character, then continuing by getting the next line from the file.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

PHP Topics