How often have you looked at a set of data know what you need to do to transform it and gone straight in to build an Alteryx workflow which works for the data you currently have, but then find a week, month, year later that you’ve got a new extract of the data but the workflow didn’t work as intended?
Now Alteryx is great when faced with this problem as you can easily work your way along the tools to identify where the workflow failed and update it accordingly. But one thing I am trying to get better at, and also teach others, is to use a different choice of tools to make the workflow more reusable and able to host on server without needing constant debugging. To show an example of how a change in the tool selection can help make a workflow more usable let us look at it in terms of the following problem.
You’ve been presented with a text file where the data needs cleaning up. Due to they way in which the user inputs the data, the same records may be split over several rows.
First approach using Alteryx
If we look at the input data we can see that there are four lines where the data is split at the same point, and a fifth line where the data is split over three lines.
Now we could bring the data straight in and use the tab delimiter to parse the data into the correct columns, however by doing this we would lose important information on the incomplete lines which would help us at a later stage to clean up the file. So instead we first bring the file in without parsing it on the tab delimiter.
Then we can use a formula tool to count the number of tab delimiters in the field (Field_1).
There is a useful Regex formula for this: REGEX_CountMatches([Field_1], "\t")
The result of this formula is a new column with a count of the number of tab delimiters for each record. Now I won’t go through in detail the other steps, but essentially what you do is check each record to check if it has the correct number of delimiters and mark that record as complete, otherwise do a running total of the delimiters until it matches the total, then concatenate the records using the summarise tool before using the text to columns tool to parse out the now complete records on the tab delimiter.
Issues with this approach
Now here lies the problem, by using the text to columns tool you need to specify the number of columns to split on, which in this case is fine as I know from the Regex_CountMatches formula how many columns to split on. But what happens when I have more than 18 (18 columns of data but 17 delimiters)? This would require the user to go into the workflow and manually update it in two places:
- In multi-row formula tool where the value to calculation has a hard-coded value
- In the Text to columns tool to update the number of columns to split on
Now we want the workflow to be usable for a different version of the file, for example one with more columns, so how can the existing workflow be tweaked to remove the need for the user to manually update. The first area is around finding the maximum number of delimiters for the records and using this as a value in the calculation. This is done by using the summarise tool to return the maximum value for the count of delimiters, and appending this value to the dataset. This means the multi-row formula can be updated to replace the hardcoded value with [Max_Num_Delim].
Then secondly the next change is in how the text to columns tool is applied. Instead of splitting the data to columns, we chose the option of ‘Split to rows’, which negates the need for knowing how many columns are in your data.
Adding a few steps to to count the number of row created for each record and Cross Tabbing the data (column number as the header) and grouping on the Record ID now gets us the data into the same shape we require without any hardcoding of numbers and a more reusable workflow.
I have published the two solutions to this problem on my github page:
Traditional approach: https://github.com/CG003/Alteryx/blob/master/Client_CL%20cleanup%20(github).yxmd
Dynamic approach: https://github.com/CG003/Alteryx/blob/master/Client_CL%20cleanupv2%20(github).yxmd