split

Overview

The split statement is used to create and/or update columns by splitting the textual value of an existing column into multiple parts.

Syntax

splitColNameusingsep

splitColNameusingsepretaining first[column_count]

splitColNameusingsepretaining last[column_count]

splitColNameusingsepretainingfirst_column_index[tolast_column_index]

The keyword following ColName may be using, separator or delimiter. All three work in exactly the same way.

Details

The ColName argument is the name of an existing column whose values are to be split and the sep argument (which must only be a single character in length) is the delimiter by which to split them.

For example a value one:two:three:four, if divided by a sep character of :, would result in the fields one, two, three and four.

To specify a sep value of a space, use " " or\<space> (where <space>is a literal space character)

New columns are created if necessary to contain the values resulting from the split. These additional columns are named ColName_split1 through ColName_splitN where N is the number of values resulting from the process.

If there is an existing column with a name conflicting with any of the columns that split creates then:

  • If the overwrite option is set then all values in that pre-existing column will be overwritten

  • If the overwrite option is not set:

    • If there are blank values in the existing column they will be updated

    • If there are no blank values in the existing column then no changes will be made to it

Keeping only specific fields

If the retaining keyword is specified, split will discard one or more of the resulting columns automatically based on the specification that follows the keyword. The following specifications are supported:

Specification

Result

first or 1

Discard all but the first column

first N or 1 to N

Discard all but the first N columns

last

Discard all but the last (rightmost) column

last N

Discard all but the last N columns

N

Discard all but the _N_th column

N to M

Discard all but the columns from the _N_th to the _M_th inclusive

In the table above, N refers to a result column's number where the first column created by split has a number of 1

Example

Given an input dataset of the form:

Name,ID
VM-One,sales:2293365:37
VM-Two,marketing:18839:division:89AB745
VM-Three,development:34345:engineering:345345:Jake Smith
VM-Four,sales::38
VM-Five,marketing:234234234:testMachine
VM-Six,development:xxxx:test
VM-Seven,1234:5678
VM-Eight,test:::
VM-Nine,field::5
VM-Ten,test::3425:

The statement ...

split ID using :

... will result in the dataset:

Name,ID,ID_split1,ID_split2,ID_split3,ID_split4,ID_split5
VM-One,sales:2293365:37,sales,2293365,37,,
VM-Two,marketing:18839:division:89AB745,marketing,18839,division,89AB745,
VM-Three,development:34345:engineering:345345:Jake Smith,development,34345,engineering,345345,Jake Smith
VM-Four,sales::38,sales,,38,,
VM-Five,marketing:234234234:testMachine,marketing,234234234,testMachine,,
VM-Six,development:xxxx:test,development,xxxx,test,,
VM-Seven,1234:5678,1234,5678,,,
VM-Eight,test:::,test,,,,
VM-Nine,field::5,field,,5,,
VM-Ten,test::3425:,test,,3425,,

Using the same original dataset, the statement ...

split ID using : retaining 3 to 5

... will result in the dataset:

Name,ID,ID_split1,ID_split2,ID_split3
VM-One,sales:2293365:37,37,,
VM-Two,marketing:18839:division:89AB745,division,89AB745,
VM-Three,development:34345:engineering:345345:Jake Smith,engineering,345345,Jake Smith
VM-Four,sales::38,38,,
VM-Five,marketing:234234234:testMachine,testMachine,,
VM-Six,development:xxxx:test,test,,
VM-Seven,1234:5678,,,
VM-Eight,test:::,,,
VM-Nine,field::5,5,,
VM-Ten,test::3425:,3425,,