Thursday, July 18, 2019

Solving problems across different data preparation tools

Lot of different tools to do the job, and maybe you can learn something new

Lots of people are wedded to certain tools for their data work, and that may be due to what resources they have at work or sticking to what they know. I get to use lots of different tools each day and wanted to demonstrate how each of them perform for data preparation challenges with no bias over preference in the tools.

So this is the output of this work, a whole 45+ minutes walking through me solving two data preparation problems across three different tools (Alteryx, PowerQuery and Tableau Prep). I've done this live so there are a few mistakes here and there (so please be kind on those).



Either watch the video throughout, or just skip to the sections in the video that cover a tool you've not used before (and hopefully learn something new).

What I found useful about going through this process is an appreciation that no tool is perfect. Something that takes you a few steps to do in one tool can be completed in one step in another, or one tool has a better user interface / description / guidance notes and that means depending on the person a transformation process might just click better for someone in another tool.

There's also something around data literacy and helping people understand what they are doing in the ETL process, you can see it in the video that there's a step where you want to unpivot your data and the terminology used across the tools varies from cross-tab, unpivot and pivot (rows to columns). So that again just demonstrates that people will understand it in different ways, but if you know what you want to achieve (I call it, when I deliver some training, making your data wide and short or long and thin), then you should be able to move between different tools fairly easily (and just Google the bits you don't know!!).

I need to give a shout out to the resources used in the videos Alteryx Weekly Challenges and Preppin' Data. Like I said in the video, don't feel that these resources are constrained to the respective vendor's tool, as they are a great resource to learn new things or try a different tool.

Oh and another bit I forgot to mention in the video, is the vendors for each of these products are releasing updates on a monthly/quarterly basis so the functionality and features are constantly evolving.

Wednesday, July 10, 2019

First attempt at vlogging - Alteryx weekly challenge

To save time and an extension to my endeavours to completing all of the Alteryx weekly challenges, I've decided to vlog my attempts at them. I hope there's some interest in them, and to make a start I've gone back to the first challenge posted to do a walk through of this. Let me know what you think.


Taking hints from gamers posting on Twitch, I've invested in a video capture device to record the Alteryx stuff and using OBS to do the recording, as my first attempt resulted in files across many devices and then required compiling into a finished product, so at least this way it's all in one place and ready to go straight after I hit 'end recording'.


Friday, July 5, 2019

My first #SportsVizSunday submission

Ok so I know it's not Sunday... But I've just posted my first #SportsVizSunday submission.

I was really keen to do this one as I love cycling and the TdF. The data set for the monthly challenge is available from Data.World and I wanted to focus on the British riders in the Tour, and because the original dataset didn't have the rider's nationality in it I needed to do some research and luckily there was this website listing all of the British riders.

So here's my visualisation:
 Interactive version

Click on the image to view the interactive version.

There was one feature which I played around with a while and finally got to work by using set actions. I wanted the circles for the British riders to stand out by size and colour, but when you selected a specific rider to make the size of the other riders blend in (as demonstrated in the video below):

So to achieve this I created two sets containing 1) a list of all the British riders and 2) the selected rider. By creating a combined set of the overlap between these two sets I could create a set action which, when a single rider was selected it would just show the rider in set 2 and when no riders were selected it would show all the British riders in set 1. This combined set could then be used on size (values in/out the set are assigned different sized circles), while maintaining the colours which were defined by a calculation:


Thursday, July 4, 2019

What I've learnt completing 100 Alteryx Weekly Challenges

So I have finally earned my Community badge for completing 100 weekly challenges.

Community Badges

So while hitting this milestone I thought it would be worthy analysing what's involved and highlighting some hints and tips I've picked up as part of participating in the weekly challenges, as I find these are a great learning resource for covering areas of the software which I may not ordinarily utilise as part of my day job!

Weekly Challenges are not that time consuming!

So firstly I want to highlight that the Weekly Challenges are not too time consuming, and most solutions need only a few tools to complete.  Looking at my solutions, the simplest workflow contains just one formula tool (excluding the text inputs, comment boxes and the browse tools) to the most complex requiring 64 tools (Challenge 165), with an average of 12 tools. I don't know my total time to complete the challenges, but I did my first challenge in August 2018, so it's taken me 307 days to complete 100 challenges, with a few intensive stints and some periods of not many completions.
  Capture

There's several challenges which I've completed in under 2 minutes, and along with a couple of colleagues at work we do the challenges Grand Prix style to add an element of competition.

It allows you to discover tools you may have never used before

So if I take a look at the tool categories I have used across these 100 challenges*, I can see that the majority fall under categories like Preparation, Join, Transform and Parse, but I've also used things like the Calgary tools which I've never used outside of the Weekly Challenges. Capture2Capture3 *at the time of writing I've actually completed 102 challenges, and the macro results therefore reflects this.

Now some tips and tricks

I've been back over some challenges to identify some techniques which may be considered intermediate level to assist with transformations. What is great is once you've used these techniques once and understand how they work you'll end up going back to them again and again.

Text to rows

Coming from Excel, many Alteryx users will be familiar with text to columns where you split a string based on a delimiter to new columns (common where the input is a comma delimited file). However one useful function in Alteryx is split to rows, I have used this on Challenge 139 where a benefit of using split to rows is to keep the data in one field, so the numbers can be parsed out in the next step. Capture4Capture5

I've also talked about how using splitting to rows can make a workflow more dynamic in another blog post, as it doesn't require knowing how many columns you need to split the field into.

Regex tokenise

So similar to splitting to rows on a delimiter is using the regex tool when there might not be a delimiter. In the case of Challenge 59, where I needed to whether words in two fields were an anagram of each other. To solve this, I split every character for each record to a new row, and from there sort the letters and then concat them back and if they match then it's an anagram.




Modulo

The modulo function is used in Tableau when building small multiple charts, and I used it a lot when doing Advent of Code. What modulo gives you is the denominator of the fraction in the remainder as part of a division. For example the modulo of the value 9 when divided by 4 is 1 as 9/4 = 2 1/4. In Alteryx this is done as part of the formula tool and the syntax is mod([Field],divisor). I used modulo in Challenge 21 to restart the numbering of the records every 12th record as this represented each month of the years 2007 - 2008. So here, I had records 1 - 24 and what I wanted was the numbering to go 1 - 12 for 2007 and then repeat 1 - 12 for 2008. So here the formula tool was used to update the RecordID field with the following function:

mod([RecordID]-1,12)+1

This then allowed me to join the numbers 1 - 12 on a text input to get the months January - December.



Reverse String

So this is a string based function which reverses the order of every character in a string. I found this really useful in Challenge 40 which involved parsing web html data. Again coming from Excel I'm familiar with using Left, Right and Mid (substring in Alteryx) functions to return a certain number of characters, but what I wanted to find every character after the last '>' however the position of this varied by record. An example of the full string is as follows:
So from this I needed a way of identifying the position of the > before Brett Abernathy so that I could get the Doctor's name and discard the rest of the string. The resulting formula was as follows:

 
substring([Doctor Name],IIF(FindString([Doctor Name], ">") > -1, Length([Doctor Name]) - FindString(ReverseString([Doctor Name]), ReverseString(">")), -1),length([Doctor Name]))

So this is used FindString to find the position of the ">", but as it works left to right you reverse the string so the last ">" becomes the first.

So to round this all up

I hope you've found these techniques are useful and will be able to use them, so go ahead and use them in your day to day Alteryx work and in weekly challenges and it encourages you to attempt more weekly challenges to discover some of your own new techniques.        

Ctrl + R = Turn on the lights

At the recent Alteryx User Group, I said that I felt that for data people Alteryx was like a swiss army knife which allows you to do virtually anything. One of the reasons for this is due to it's ability in a no code environment connect to APIs which extend the possibilities of what the tool can do. In my day job I typically use Alteryx for data preparation and analysis, but by using the Download Tool, you can make use of web-based applications through API requests.

A short intro to APIs

If you are unfamiliar with what an API is (Application Programming Interface), then this definition is a good starter for 10 - "a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service". An API request is typically made up of a url, request and authorisation. The Alteryx community has these two good posts which outline APIs in a bit more detail:
At work I typically use APIs to pull financial information, such as share price information, but for the purpose of this blog I'm looking at a fun use of Alteryx - using an API request to turn on/off some internet connected lights.

Controlling Hue lights using Alteryx

Applications which use APIs come with documentation on how to set up an account and make a request, for Philips Hue connected bulbs the documentation is outlined here. When I first used APIs I found the documentation difficult to understand, but the more APIs I've used I've found that there are a few basics get set up and these are covered in the Alteryx Community links above. Then the benefit of using a tool like Alteryx is you can turn the workflow into an analytic app which allows a user interface to easily update the values. So I am going to follow the documentation and show you how to translate this to Alteryx to control your home lights and understand how to use APIs. So in the documentation, the first step is to get the IP address of the hub which controls the lights (I've hidden my IP address for obvious reasons).

Image-1

Next we can find out the username which is required to make a request. This is easy to set up in Alteryx only needing two tools (and a browse to easily see the results). Get Username
The text input tool is set up with the URL for the API request and the body of the request.

Username text input

The first request is a POST request which means a new resource is created (i.e. a username). This request is set up in the configuration pane of the download tool, where we select the field which contains the URL of the request, the http action, and the query string of the request.

Username download 1Username download 2

When the workflow is run the field "DownloadData" contains the JSON output of the download tool which gives you the username.
  username result

Now we have a username, we need to understand the information of our lights to make the request. This is acheived with a GET request using the following URL - http://{hub ip address}/api/{username}/lights This can be configured by updating the URL in the text input tool with the above, and changing the configuration of the payload request in the download tool to be a GET request (i.e. read information).

Get light names

The response to this is a detail of all the lights connected to the hub. In my home I have over 50 lights connected to the hub, so using the JSON parse tool you can structure the JSON result in an easily readable table.

  Get light names 2Get light names 3.PNG

From here we now have all the information we need to make the final request which will change the state (turn on / off) the specific light. This is done via a PUT request* with this modified URL - http://{hub ip address}/api/{username}/lights/{light ID}/state and a body which tells the light whether to turn on or off {"on":true}.

Turn Light on

*One quirk I found is Alteryx expects the body of a PUT request to be a blob, this is because typically PUT requests involve uploading data, however the workaround is to make a custom request in the download tool and just type PUT. Running this workflow then results in the following: Now taking this a step further, you can tweak the workflow to use the interface tools which turns the workflow into an Analytic App which creates a user interface to pick which light you want to control. There's also additional requests which mean you can change things like the brightness and colour of the bulb.

  Turn Light on 2
The app in action...
large.gif

Debugging Alteryx Analytic Apps

Analytic Apps in Alteryx are great, they can either be hosted on an Alteryx Server with a web-based user interface which allows non-Alteryx users to run the workflow or a user with Designer can run them on their desktop without devling inside the workflow. This post isn't focused on what an Analytic App is or how to build them. If you want to know more about creating Analytic Apps then check out this Alteryx YouTube video.

I use Analytic Apps a lot as they are a way of having a user easily run a workflow on new data without fiddling around with the workflow to point the input tools at the new datasource. However, as I recently found out, a downside of running an app is you don't get the benefit of running the workflow directly inside Designer to see what happens to the flow of your data (which is one of the reasons I love Alteryx). For example, in Designer it is clear how many records are passing through each tool.

Capture.PNG

However, when you run it as an analytic app it becomes a bit of a black box. Yes you can see the results window which tells you some of the information but you don't get to see your data at each anchor like you do in Designer. So if something unexpected starts happening to your workflow it can tricky to workout why.
Capture2

Debugging mode to the rescue

So I need to admit now, that when Interface Designer window is loaded this big button called 'Open Debug' I'd never pressed before and while it's obvious from the name it was only after a couple of hours of going 'the workflow runs fine in Designer, I can't work out why when it is run as an app that it goes wrong?' that I decided to press it and see what it does! So the workflow that I discovered this on was the one below where I have lots of nested text input boxes and detour tools to run an API via different end points depending on the user's inputs. Capture3

Analytic App.gif

I was finding that I could run the workflow in Designer fine, and I would manually update the direction of the detour tools to run each branch of the workflow and hence check each API call was working correctly. Yet when I tried to do the same in the App Interface I was getting errors as the API calls were not being correctly formed, which is when I turned to Debug mode. Debug mode is activated in the Interface Designer. First press on the magic wand icon to load up this screen (below) and then complete the Analytic App with the data you want, then click on the 'Open Debug' button to the right.
Debug1

This then opens up a new workflow with the settings from the App detailed at the top of the workflow.

  Debug2.PNG

You can now run the workflow and see the data running through the workflow to help understand where the Analytic App is (or in my case not) updating a specific value and flowing through the workflow.

Debug3

As you can see Company X is now passed into the workflow in the q field which was what I selected in the Analytic App so i know it is working as expected and can now save myself many headaches in the future!!!

Making Alteryx workflows more re-usable


Input data Untitled drawing1

Untitled drawing2


How often have you looked at a set of data know what you need to do to transform it and gone straight in to build an Alteryx workflow which works for the data you currently have, but then find a week, month, year later that you’ve got a new extract of the data but the workflow didn’t work as intended? 

Now Alteryx is great when faced with this problem as you can easily work your way along the tools to identify where the workflow failed and update it accordingly.   But one thing I am trying to get better at, and also teach others, is to use a different choice of tools to make the workflow more reusable and able to host on server without needing constant debugging. To show an example of how a change in the tool selection can help make a workflow more usable let us look at it in terms of the following problem.

Problem
You’ve been presented with a text file where the data needs cleaning up.  Due to they way in which the user inputs the data, the same records may be split over several rows.

First approach using Alteryx
If we look at the input data we can see that there are four lines where the data is split at the same point, and a fifth line where the data is split over three lines.

Now we could bring the data straight in and use the tab delimiter to parse the data into the correct columns, however by doing this we would lose important information on the incomplete lines which would help us at a later stage to clean up the file.  So instead we first bring the file in without parsing it on the tab delimiter.

Untitled drawing3

Then we can use a formula tool to count the number of tab delimiters in the field (Field_1). Untitled drawing4
There is a useful Regex formula for this: REGEX_CountMatches([Field_1], "\t")

The result of this formula is a new column with a count of the number of tab delimiters for each record. Now I won’t go through in detail the other steps, but essentially what you do is check each record to check if it has the correct number of delimiters and mark that record as complete, otherwise do a running total of the delimiters until it matches the total, then concatenate the records using the summarise tool before using the text to columns tool to parse out the now complete records on the tab delimiter. Untitled drawing5

Issues with this approach
Now here lies the problem, by using the text to columns tool you need to specify the number of columns to split on, which in this case is fine as I know from the Regex_CountMatches formula how many columns to split on.  But what happens when I have more than 18 (18 columns of data but 17 delimiters)? This would require the user to go into the workflow and manually update it in two places:
  1. In multi-row formula tool where the value to calculation has a hard-coded valueUntitled drawing6
  2. In the Text to columns tool to update the number of columns to split onUntitled drawing7
So how can the workflow be made more dynamic / reusable?
Now we want the workflow to be usable for a different version of the file, for example one with more columns, so how can the existing workflow be tweaked to remove the need for the user to manually update. The first area is around finding the maximum number of delimiters for the records and using this as a value in the calculation. This is done by using the summarise tool to return the maximum value for the count of delimiters, and appending this value to the dataset. Untitled drawing8 This means the multi-row formula can be updated to replace the hardcoded value with [Max_Num_Delim].

Untitled drawing9

Then secondly the next change is in how the text to columns tool is applied. Instead of splitting the data to columns, we chose the option of ‘Split to rows’, which negates the need for knowing how many columns are in your data.

Untitled drawing10

Adding a few steps to to count the number of row created for each record and Cross Tabbing the data (column number as the header) and grouping on the Record ID now gets us the data into the same shape we require without any hardcoding of numbers and a more reusable workflow.

Untitled drawing11

I have published the two solutions to this problem on my github page:
Traditional approach: https://github.com/CG003/Alteryx/blob/master/Client_CL%20cleanup%20(github).yxmd

Dynamic approach: https://github.com/CG003/Alteryx/blob/master/Client_CL%20cleanupv2%20(github).yxmd