April, 2010 | Trigger Happy

First I want to mention that the challenges on databasechallenges.com are a blast. I recommend trying some of them to hone your TSQL skills. I haven’t had as much time as I would have liked to try as many as I would have liked, but I submitted solutions to a couple of them, and one of them was chosen as an accepted solution.

Jacob Sebastian has requested that I write an explanation for my submission, so here goes…

For a statement of the challenge itself, go here: http://beyondrelational.com/blogs/tc/archive/2009/09/28/tsql-challenge-14-identify-the-longest-sequence-of-characters-in-a-string.aspx

For my posted solution, go here: http://databasechallenges.com/SQLServer/TSQL/Challenge14/Submissions/Mike_DeFehr

In a nutshell, you’re given some strings and asked to identify and quantify repeating characters within the string. As with many TSQL problems (and particularly with problems on the challenges site) the solution begins with a table of numbers or a tally table. A table of numbers is just that: a table with one row for each of the integers from 1 to as many as you need. Typically, one might have such a table with potentially millions of rows in it somewhere in your SQL server – the rules of the challenge do not allow you to create one, but they do allow you to use one of the system tables (spt_values) as a table of numbers. You get a table of a couple of thousand numbers with the following query (which is plenty for this challenge):

    SELECT number
    FROM master..spt_values
    WHERE number > 0
        AND type = ‘P’

One of the things we can do with a table of numbers is to effectively turn a string into an array of characters. By feeding the number from the table into the SUBSTRING function, this can be accomplished quite easily:

    DECLARE @String varchar(100)
    SET @String = ‘abcdefghijk’
 
    SELECT      number AS ArrayIndex,
                SUBSTRING(@String,number,1) AS [Character]
    FROM master..spt_values
    WHERE number > 0
          AND type = ‘P’ 
          AND number <= LEN(@String)

    ArrayIndex  Character
    ———– ———
    1           a
    2           b
    3           c
    4           d
    5           e
    6           f
    7           g
    8           h
    9           i
    10          j
    11          k
 
    (11 row(s) affected)         

Note that if you like a zero-based array, a few subtle changes are required, but it’s no problem:

    DECLARE @String varchar(100)
    SET @String = ‘abcdefghijk’
 
    SELECT  number AS ArrayIndex,
            SUBSTRING(@String,number+1,1) AS [Character]
    FROM master..spt_values
    WHERE number >= 0
        AND type = ‘P’
        AND number < LEN(@String)

    ArrayIndex  Character
    ———– ———
    0           a
    1           b
    2           c
    3           d
    4           e
    5           f
    6           g
    7           h
    8           i
    9           j
    10          k
 
    (11 row(s) affected)

Now that we have our string in array form, we can partition a new row number by Character. This puts all like characters together whether they are in sequence or not. This is where the solution starts to depart from traditional iterative thinking. We will not be “going through” the string looking at the characters and seeing if they are in sequence. We are establishing properties about our set of characters that we will use to form our solution. Here is what it looks like with the Partitioned number. I’ll use a slightly more interesting string in this example.

    DECLARE @String varchar(100)
    SET @String = ‘AFBB04BFFF5’
 
    SELECT    Number AS ArrayIndex,
            SUBSTRING(@String,Number,1) AS [Character],
            ROW_NUMBER() OVER (
                    PARTITION BY SUBSTRING(@String,Number,1)
                    ORDER BY Number) AS PartitionedNumber
    FROM master..spt_values
    WHERE Number <= LEN(@String)
        AND number > 0
        AND [type] = ‘P’
   
    ArrayIndex  Character PartitionedNumber
    ———– ——— ——————–
    5           0         1
    6           4         1
    11          5         1
    1           A         1
    3           B         1
    4           B         2
    7           B         3
    2           F         1
    8           F         2
    9           F         3
    10          F         4
 
    (11 row(s) affected)

Notice that when you partition a row number, it starts over every time the character changes. Our string is “out of order” now, but that is fine – this is one of the challenges when starting to think in sets – order is not important. We have preserved the property, or the idea of string order in the ArrayIndex column. The result set happened to come out this way because the optimizer found it most efficient to change the order to satisfy the requirements of the ROW_NUMBER operation, but it could come out in any order for the purposes of our solution. This order *is* convenient, however, to visualize the next step. Notice that the ArrayIndex and The PartitionedNumber columns track together when there is a sequence of characters. They can be completely different numbers, but when a sequence starts happening, they go up together. When the same character exists, but is out of sequence, the pattern is broken (as what happens with the ‘B’). This is a technique I adapted from Itzik Ben-Gan’s solution for finding islands (unbroken ranges of values in a set). If you get a chance to see his “tips and tricks” demos (or any of his presentations for that matter), there is much gold there. Taking the difference between the two columns not only illustrates this better, but gives us something we can use:

    DECLARE @String varchar(100)
    SET @String = ‘AFBB04BFFF5’
 
    SELECT    Number AS ArrayIndex,
            SUBSTRING(@String,Number,1) AS [Character],
            ROW_NUMBER() OVER (
                    PARTITION BY SUBSTRING(@String,Number,1)
                    ORDER BY Number) AS PartitionedNumber,
            number – ROW_NUMBER() OVER (
                    PARTITION BY SUBSTRING(@String,Number,1)
                    ORDER BY Number) AS RNDiff
    FROM master..spt_values
    WHERE Number <= LEN(@String)
        AND number > 0
        AND [type] = ‘P’
   
    ArrayIndex  Character PartitionedNumber    RNDiff
    ———– ——— ——————– ——————–
    5           0         1                    4
    6           4         1                    5
    11          5         1                    10
    1           A         1                    0
    3           B         1                    2
    4           B         2                    2
    7           B         3                    4
    2           F         1                    1
    8           F         2                    6
    9           F         3                    6
    10          F         4                    6
 
    (11 row(s) affected)

We have 2 sequences in our string: 2 B’s and 3 F’s – these are highlighted by identical amounts in the new RNDiff column. In fact, within any given character, if the RNDiff amounts are the same, they represent a sequence – so we can apply a group by. We will make our existing query into a derived table and feed it into a group by query:

    DECLARE @String varchar(100)
    SET @String = ‘AFBB04BFFF5’
 
    SELECT    [Character],
            MIN(ArrayIndex) AS Pos,
            COUNT(*) AS [Len]
    FROM (    SELECT    Number AS ArrayIndex,
                    SUBSTRING(@String,Number,1) AS [Character],
                    ROW_NUMBER() OVER (
                        PARTITION BY SUBSTRING(@String,Number,1)
                        ORDER BY Number) AS PartitionedNumber,
                    number – ROW_NUMBER() OVER (
                        PARTITION BY SUBSTRING(@String,Number,1)
                        ORDER BY Number) AS RNDiff
            FROM master..spt_values
            WHERE Number <= LEN(@String)
                AND number > 0
                AND [type] = ‘P’    ) a
    GROUP BY [Character], RNDiff
   
    Character Pos         Len
    ——— ———– ———–
    A         1           1
    F         2           1
    B         3           2
    0         5           1
    B         7           1
    4         6           1
    F         8           3
    5         11          1
 
    (8 row(s) affected)

What this is telling us is that we have 8 sequences of characters – most of which have a length of only one. We’re only interested in sequences greater than one, so a HAVING clause will take care of that:

    DECLARE @String varchar(100)
    SET @String = ‘AFBB04BFFF5’
 
    SELECT    [Character],
            MIN(ArrayIndex) AS Pos,
            COUNT(*) AS [Len]
    FROM (    SELECT    Number AS ArrayIndex,
                    SUBSTRING(@String,Number,1) AS [Character],
                    ROW_NUMBER() OVER (
                        PARTITION BY SUBSTRING(@String,Number,1)
                        ORDER BY Number) AS PartitionedNumber,
                    number – ROW_NUMBER() OVER (
                        PARTITION BY SUBSTRING(@String,Number,1)
                        ORDER BY Number) AS RNDiff
            FROM master..spt_values
            WHERE Number <= LEN(@String)
                AND number > 0
                AND [type] = ‘P’    ) a
    GROUP BY [Character], RNDiff
   
    Character Pos         Len
    ——— ———– ———–
    A         1           1
    F         2           1
    B         3           2
    0         5           1
    B         7           1
    4         6           1
    F         8           3
    5         11          1
 
    (8 row(s) affected)

Now that this is starting to look suspiciously like the answer to our question – note that the minimum of the ArrayIndex column is the position of the start of the sequence and the count is the length of the sequence. We’re ready to apply this technique to the data in the challenge. Funny I should use the word “apply” because the cross apply operator is what we need here. Cross apply will execute the query using each row of data as it’s input and will expand the number of rows in the result by the number of rows it returns – we see that this is exactly what is required by the expected output. Replacing the @String variable with the data from the sample data table, here is the final result:

    DECLARE @t TABLE (Data VARCHAR(40) )
 
    INSERT @t (Data) SELECT ‘9992EDC6-D117-4DEE-B410-4E5FAE46AE97’ 
    INSERT @t (Data) SELECT ‘0BFC936B-BD9A-4C6A-AFB2-CF3F1752F8B1’ 
    INSERT @t (Data) SELECT ‘4A73E7EB-7777-4A04-9258-F1E75097977C’ 
    INSERT @t (Data) SELECT ‘5AAF477C-274D-400D-9067-035968F33B19’ 
    INSERT @t (Data) SELECT ‘725DA718-30D0-44A9-B36A-89F27CDFEEDE’ 
    INSERT @t (Data) SELECT ‘8083ED5A-D3B9-4694-BB04-F0B09C588888’
 
    SELECT    t.Data,
            c.[Char],
            c.[Pos],
            c.[Len]
    FROM @t t
        CROSS APPLY (SELECT    [Char],
                        MIN(RowNum) AS Pos,
                        COUNT(*) AS [Len]
                FROM    (    SELECT    Number AS RowNum,
                                SUBSTRING(t.Data,Number,1) AS [Char],
                                ROW_NUMBER() OVER (
                                    PARTITION BY SUBSTRING(t.Data,Number,1)
                                    ORDER BY Number) – Number AS RNDiff
                        FROM master..spt_values
                        WHERE Number <= LEN(t.Data)
                            AND number > 0
                            AND [type] = ‘P’
                    ) b
                GROUP BY [Char], RNDiff
                HAVING COUNT(*) > 1
                ) c
    ORDER BY MAX(c.[Len]) OVER (PARTITION BY t.Data) DESC, Data, Pos
   
    Data                                     Char Pos         Len
    —————————————- —- ———– ———–
    8083ED5A-D3B9-4694-BB04-F0B09C588888     B    20          2
    8083ED5A-D3B9-4694-BB04-F0B09C588888     8    32          5
    4A73E7EB-7777-4A04-9258-F1E75097977C     7    10          4
    4A73E7EB-7777-4A04-9258-F1E75097977C     7    34          2
    9992EDC6-D117-4DEE-B410-4E5FAE46AE97     9    1           3
    9992EDC6-D117-4DEE-B410-4E5FAE46AE97     1    11          2
    9992EDC6-D117-4DEE-B410-4E5FAE46AE97     E    17          2
    5AAF477C-274D-400D-9067-035968F33B19     A    2           2
    5AAF477C-274D-400D-9067-035968F33B19     7    6           2
    5AAF477C-274D-400D-9067-035968F33B19     0    16          2
    5AAF477C-274D-400D-9067-035968F33B19     3    32          2
    725DA718-30D0-44A9-B36A-89F27CDFEEDE     4    15          2
    725DA718-30D0-44A9-B36A-89F27CDFEEDE     E    33          2
 
    (13 row(s) affected)

Note that the final piece was to get the sorting right. The challenge required that the string with the longest sequence be on top, but then it should be sorted by the order in which the sequences appear in the string. This is accomplished by partitioning a maximum of the length of the sequences by Data. The important thing to note here is that the OVER clause can be applied to any aggregate and does not require a group by since the value is simply repeated for each member of the group.

This completes the solution. This was an excellent challenge because it required an interesting set-based solution to what is intuitively an iterative problem and made good use of the newer language elements (CROSS APPLY and the OVER clause applied to both an aggregate and ROW_NUMBER).

Enjoy and be sure to try the next TSQL challenge.

April, 2010:

Explanation for TSQL Challenge 14

Categories

Archives

Calendar

Pages

Meta