Saturday, November 16, 2019

TC19 Wrap up

So that’s a wrap, TC19 has finished! So what did I get up to and what were my thoughts of conference.



For me my conference was broken into the following session types:

Keynotes
Hands on training 
Certification; and
Other demonstration sessions

So here are my thoughts on the above.

Keynotes

Opening Keynote

Wow, just wow! It always surprises me by the sheer number of Tableau users come to Conference and the size of the venues. This year Conference was at The Mandalay Bay hotel in Las Vegas, and the arena that normally host concerts (I think the capacity was 7,000). And the opening to the keynote lived up to all of the theatre of concert! 



The first content of the keynote, was delivered by Adam Selipsky and focused on building a data culture and drew out an analogy against how building a data culture was to the war effort of the code breakers in WWII and also how diversity is very important in a data culture by focusing on some key women in the code breaking effort, which was highlighted by the book Code Girls, written by Liza Mindy (who was also present in the opening keynote and received a standing ovation).
This focus on a data culture brought out these elements, proficiency, including everyone, agility and community, which led into the introduction of Tableau Blueprint which is Tableau’s framework and resources for building out a data culture in your organisation.


I’m keen to look at the resources available, as part of my role over the last 2+ years has been in tech-enabling my team and building an internal community. So hopefully it will validate the approach we’ve taken and introduce some resources to help us further on this journey.

There was then a fireside chat with Marc Benioff, the Salesforce CEO. When the conference was held in Berlin earlier this year Salesforce had just announced their intention to acquire Tableau, now with the deal finalised there was a lot more available around the strategy of the acquisition, although it was a very high-level set of initial thoughts. It will be interesting to find out if more is said at Salesforce’s Dreamforce event next week.


The final part of the opening keynote focuses on some of the new features which have either just been released or are likely to be part of the next couple of releases.

Key updates from this that were exciting for me were:

 - Prep builder moving to the browser: this will mean that the data preparation and visualisation and publication can now be done end to end in the cloud. This will be important along with the AWS partnership to help scale deployments and potentially (and this is me speculating) the ability to add different tiers to the subscription model where an online prep and Desktop can have a reduced feature set for a lower price point.

 - Ask Data and Explain Data: I’ve put these these two together as I see them working together to further democratise data analysis. With both being go to tools to assist with the initial data discovery and to help consumers of the dashboards to interact with the content and delve deeper in the insights.

- Data Model: Like above this will help democratise data analysis. Often users will get confused with how the data should be related and joining data at different levels of aggregation. These are all things that experienced ‘data people’ will be familiar with and know how to deal with, such as the different types of joins and the use of LOD expressions to roll up granular records to a more aggregate level. I first saw this last year at the New Orleans conference and was initially disappointed as, while I totally get the benefits of making it easier for people to do the analysis and not everyone is currently a ‘data person’, I thought it hid the stuff that was join on under the hood and could make people feel like they didn’t need to understand important concepts in data engineering. However, now seeing it for the second time I’m much more open to the Data Model features as it can help act as a stepping stone by taking away some of the initial complexities and as users become more familiar to data engineering you can then look under the hood and make changes if they are required.

All of the features stuff was just a teaser to the second day’s content for Devs on Stage.

Devs on Stage

As a precursor to the Devs on Stage session, I bumped into Andrew Beers (CTO) on the morning 5km run. When running with Andrew in the morning I didn’t know that he was CTO and I just thought he was a product developer, but having that ignorance during the run was probably good as we chatted about a broad range of topics instead of focusing on the roadmap which was covered a few hours later in the keynote.

Here’s the whole list of stuff covered in the Keynote, with me expanding on the areas I’m excited about or thought they could be implemented in a different way.

Visualisations & Web
 -Mark animations
The lack of animations in Tableau has been a missing feature for a while now, and is one of the reasons people love D3 so much. And as Hans Rosling showed in his presentations (TED Talk) movement is a very important pre-attentive attribute when showing movement over time, or changes in position. From having had a play with mark animations in the alpha release it looks like Tableau’s implementation of this will make presentation of dashboards very engaging.

 - Pages play button on the web
 - PDF subscriptions
 - Extract refresh based on schedules
 - Public - web editing
 - Public - Web Explore
I can see this as being a easily misused feature. It’s great that you can very easily open up another person’s workbook to get under the hood to understand how it was built, but the fact you can now also edit on the web and then save it to your profile (yes it automatically attributes it to the original author) how many duplicate versions are we going to see of people’s IronViz submissions or weekly challenges like Makeover Monday. The plagiarism of visualisations on public has already been highly discussed on social media, including this post by Hesham (IronViz winner).


 - Tooltip editing in the browser

Prep Builder 
 - List view
 - Fixed LOD
 - Rank 
 - Reusable steps
This will allow multiple steps which may be common in your Flows to be saved to Server and they can be reused. Coming from a background with a lot of experience with Alteryx were this concept is a standard macro where it means common steps with a number of tools can be collapsed down into a single tool in the workflow, I was a bit disappointed to see the result in Prep is to insert all of the steps that make up the saved flow so it doesn’t save screen real estate.

 - Incremental refresh

Data Modelling
I’ve already covered my thoughts on data modelling above,  and I would encourage you to re-watch the keynote for this section as it is much easier to see a visual representation of the Data Model than writing out the functionality here.

 - Relationships
 - Automatic LOD
 - Automatic Join
 - Multi-Fact Schema Support

Server administration
 - Custom welcome message
 - Custom login message
 - Login based licences
The change here will mean that you change the persona of a user on Server it will allocate them the according licence, therefore if you change someone to Creator they will get given a Creator licence allowing them to use Desktop to now build their own dashboards. However what wasn’t clear from the Keynote was how this is done in practice and work with license agreements. For example when converting someone to a Creator does it allocate a licence from an asset pool managed on server, and if there are no available Creator licences will it go off and buy one? Equally if a user moves from Creator to Viewer does there key get given back to be reallocated to another user? If any one has the answer to this then let me know and I’ll update the blog.

 - Resource monitoring tool
 - Content migration tool

Analytics
 - Set control 
With the introduction of Set Actions last year, it enabled users to use interactivity on the dashboard to update members of the set. This brought in very useful functionality which Bethany Lyons covered in great detail at last year’s conference. However what if you don’t want to have another sheet in your dashboard to interactively change the members in the set and instead have a interface that looks like a multi-select filter. This is what Set Control enables, so it can make a simpler interface, but I still love the interactive option.

 - Buffer calculations
I think this has to be my favourite new feature to be introduced, and I ended up chatting a lot to the Developers in the Data Village about additional functionality it would be good to see with the buffer. Following the introduction of the spatial calculations in 2019.1(?) with MakePoint() and Distance(), Buffer() now enables you to create a trade area around the point and use an intersect join to spatially match two datasets (common examples will be customers and stores) based on the radius of the buffer. Coming from Alteryx this is a feature I’ve used lots in Alteryx using the Trade Area tool and Spatial Match, but what doing this now in Tableau enables is the easy change to the radius of the buffer to see what changes in radius size does to the number of customers in overlapping trade areas.

Source: https://twitter.com/hyounpark/status/1195038757054906368?s=21

 - Dynamic parameters
I think this was the feature that got the most cheers from the crowd and it’s been one of the most requested features on the community. So now it’s finally arrived, albeit they didn’t say which version it’s going to be in. For those that don’t know, at the moment when you create a list parameter you can complete the list by taking the values from a field in the data source. However, that list then become static and doesn’t get updated when there is a new extract of data (imagine a list of dates). Now with Dynamic Parameters the list of values in the parameter control are now longer static, so it will save having to recreate the parameter list and publishing the workbook when there’s new values in the list. 

Hands on Training

One of my key reasons I managed to get my employer to support my attendance at conference was the ability to hand select a training course that is suitable for my interests and skill set which wouldn’t be possible from an off the shelf training course. So the majority of the sessions I chose were hands on training sessions where you have a laptop in the room and follow along with an instructor on some advanced/Jedi level sessions. I picked session on using Python inside Tableau, using the metadata api, Jedi calcs and advanced table calculations. 

The way these sessions are set up are you have access to a virtual environment which means all of the required content is available to you (such as already having Anaconda installed in the Python session) and a set of instructions to follow along with. You also get to take the data and instructions away with you, so post conference you can reflect on what you did and practice, instead of leaving behind a saved workbook with all your advanced calculations and only memories of how to do it.

Also the benefit of these sessions you can do stuff outside your knowledge but then gain an understanding of how that can be applied. So before the conference I knew nothing about the APIs available and what you would use them for. But having now attended a hands on session for the Metadata API, where I was using the GraphiQL interface to make GraphQL queries, I now have an appreciation of what is possible and will be looking out how this can be applied in work.


Certification

I’m not going to provide an overview of the DCP exam as there are already a lot of blog posts covering this, such as this one by Mark Edwards.

Another selling point for my business case of going to Conference was that I could take my Desktop Certified Professional exam there (at a $100 discount!). As the DCP exam is 3 hours I preferred the option to do this at conference instead of at home as you don’t have the set up time with the online Proctor and for me it feels like.a more formal exam experience instead of sitting on my sofa at home doing it (which is what I needed to do for my associate exam due to WiFi issues in the set up).

A consideration that you’ll need to take into account if you decide to take your exams at conference will be that you do the exam in a room with around 200 people who will be doing a variety of exams so there will be a flow of people in and out as they will finish their exam earlier than you. I didn’t find this too distracting as I took the exam at the earliest time available and it seems like other people just wanted to get the exam out of the way so the room filled up at 9am which minimised the disruption.

I am now on the 3 week wait for results...

Other sessions

Hackathon Demos

After my DCP exam I attended the demos where just over 20 people presented what they had done for 6 hours earlier that day for their Hack. 

I found this to be a really fun session to see the variety of hacks the teams produced. 

If I get to go to conference again next year I would love to take part in it, as one of my post-conference promises to myself is to dig into the APIs that are available and start learning about them, so the Hack would be a good opportunity to put this to the test.


Explain Data

One of the tips I received last year when going to Conference was “Make sure you go to one of Bethany Lyons’ sessions”, and I want to share that with everyone reading my blog as a tip, as last year her session on set actions was the best session I went to. So this year as soon as the app came out with the conference sessions I made sure I tracked down what Bethany was covering this year and attend one of those. 

I haven’t used Explain Data yet, but saw what it could do from following attendees tweets from TC Europe. So I was keen to get a demonstration. In this session Bethany took us through how it can be used and what Explain Data is not. Like the Data Modelling tools it becomes a tool to help guide you. So Explain Data can help guide you along your data exploration journey and understand trends and outliers in your data. However she made very clear in this session that just because a data point is an outlier it shouldn’t just be excluded, it might be that data need normalising and then the point isn’t an outlier. Also just because Explain Data does some clear stats in the background it doesn’t mean you should take the results at face value, as correlation doesn’t mean causation. With this I see a feature that will be helpful but to be used with caution and ensure you can at least recall your high school stats courses when using it.


Beyond Design - Secret for Creating Engaging and Effective Information Experiences 

This session was led my two Zen Masters so they can be seen as masters in their field. The two presenters were Michael Cisneros and Liliach Manheim.   
This was a very engaging presentation which looked to answer the community question of ‘how to I make effective visualisations’ so by going through a few standard visualisations many of which were familiar to me as they are shared on Tableau Public, but it also included examples not built in Tableau.

The session drew out the design elements which you need to consider when building effective dashboards. I really enjoy these sessions are they are ones where you can reinforce the elements of visualisation design that you may have self-taught through reading books and blog posts but have it delivered by experts in their field.

Two minute tips and tricks 

This was always going to be hard slot to entertain having to fill the afternoon slot at the end of conference and after Data Night Out! This session was led by Jeffery Shaffer (author of Big Book of Dashboards) and Luke Stanke (co-contributor to Workout Wednesday), and both Zen Masters, so it was sure to be full of tips and tricks.

Unfortunately I didn’t get any photos from this session as it required focus to keep up with how quickly they could go through the tips, so it will be a workbook I’ll be sure to download from Tableau Public next week (and hopefully there will be a recording to watch it again)

Tips included, building out a Top 10 and Bottom 10 view on one sheet which makes comparison of the results much easier without needing to change the axis when putting it on a dashboard, a KPI dashboard, building out a calendar using the index() function to fill in missing dates, and date calculations to get the last whole month and then relative calculations such as month over month and year over year change.

Labs feedback session

The way the schedule works out at conference means that you can end up with some good slots in your schedule to explore the Data Village. There’s so much going on in the Data Village, with it being the place to get food, meet vendors, breakout sessions and an area just to have a bit of a chill out. But for me the experiences in the Dev areas where you can play with the new features, speak to the teams behind them and do a interactive feedback sessions like the one I took part in exploring the potential to use a chatbot to explore the data. I was keen to take part in this session to see where the Devs thought the product could go. For me, in my role, it’s important to know your audience and when looking at financial information different stakeholders want to consume that in very different ways, so a chatbot could have a use case in this example. Plus one of the learnings from the Effective Dashboards session, people understand text so not everything needs to be presented as a chart.

The session was based on the Titanic passenger dataset, so you could ask questions via a Slack interface such as “How many passengers were on board?”, “What was the distribution of passenger ages?”, and “Did age increase your change of survival?”. The results I got from my questions were mixed, for example receiving the results as a chart when I would just expect a number to be presented or a bar chart when expecting a scatterplot. 

The session was a ‘Wizard of Oz’ experiment, which is a common method used with the Lean methodology. This meant that Tableau hadn’t build a platform which could do the NLP, query the data and present back the data. What they actually had was someone behind a screen using Tableau Desktop to surface the results and sharing them back on a screen. This technique is very common with examples being betting companies testing whether providing a card account service would be beneficial for their premier accounts, but with minimal set up and outlay they just provide a pre-paid card (think like Revolut) that a team can modify the balance on manually (with a slight delay) to test out the market for this product. So it will be interesting to see (and if) Tableau develop this further as there’s lots of considerations to factor in, such as not knowing the field names and being able encode what the natural query actually represents. Something you can see developing anyway with the AskData product.

Summary

So that’s a wrap of my conference which was a very valuable experience. The next US conference is going to back in Las Vegas next year (I was hoping for somewhere closer to East Coast, just so the time difference wasn’t as big which would make the jet lag more manageable). But I’m excited to get back to my team and share my learnings with them.


Tuesday, October 29, 2019

Finding unique values in Alteryx

This is an extension of an Alteryx Weekly Challenge (Challenge #8). In the weekly challenge it wanted the user to find combinations of purchased items to form a meal deal (i.e. people who purchased pizza, a side, and a drink), but here we are going to use a similar technique to find unique values across both string based values (as in the weekly challenge) and spatial data.

I am going to approach this in a few different ways:
1) Using a string field from transactional data
2) Spatially matching customers to stores
    a) using a similar approach to 1 with spatial match
    b) using the spatial process tool

A potential business application of this is for market basket analysis. For example if you know that most of your customers would buy a table and a chair from you, then you can identify those customers who only brought the chair and maybe understand why they didn't also buy the table from you.

Approach 1 - string values in transactional data

If we use a dataset that we are familiar with such as the Tableau Superstore dataset we have here a record for every item purchased by a customer, broken down at Category, Sub Category, Manufacturer and Product level.

If we take this data into Alteryx, then we can use the Summarize Tool to group by Customer ID and Sub Category.



This results in a list where one record represents a Customer and Sub Category. So if a customer has, for example, purchased many individual products across many products in say the Tables sub category it would summarise the individual transactions down into one record for that customer and Tables.



Once we've got this summarised list we then turn this into a concatenated string of the sub categories, for example if a customer purchased some 'Binders' and 'Tables' the resulting value will be 'Binders,Tables'*. At this stage I also add in a count of the Sub Categories as finding the unique values at the end can either be undertaken by filtering the values where the count = 1 or by the approach I outline below.



* The options allow you to specify the delimiter.

Once we've got this list it's just the case of using the filter tool to find a customer who only purchased from one category. For example if I want to find customers who only bought a 'binder' then I can filter my concatenated field on:
  • Contains([Concat_Sub_Category],'Binder') - which will return all of my customers who bought a binder AND
  • !Contains([Concat_Sub_Category],',Binder') OR !Contains([Concat_Sub_Category],'Binder,') - which will remove those records where they also bought another product (N.B. As the concat option is set to use a comma as the delimiter, using the does not contain with the comma at the start and end of binder, will remove records where there are other items from other sub categories also included)

Approach 2a - Unique spatial values (using spatial match)

Taking this example to spatial data, imagine you've got a set of stores and you want to know could you close one of them to save costs. You would want to make sure that there was an alternative store for them to purchase your product from.

This is where finding unique spatial values comes in to play. Within in Alteryx there is some pet store data which is embedded in the sample workflow so I am going to use that.

Using the customer data (spatial points) and the store trade area (spatial polygon) the spatial match tool can be used to identify when the customer (target) sits within the trade area (universe). If a customer location is within multiple trade areas you get a record for each match.

Using the same technique as set out for the string based data, it's possible to group by customer and store id to find all the matches, and then you can use the filter to extract the records of interest. So to expand on the example you could find all the customers who only have 1 nearby store and close the store which has the minimum number of customers affected. Or you could find the customers who are within the catchment area for Store A and Store B.



Approach 2b - Unique spatial values (using spatial process)

An alternative approach to the above is to use the spatial process tool to cut out the intersect (to find customers where they have more than one nearby store) or to cut out the non-intersecting part. This in combination with the spatial match process outlined above will enable the user to understand the stores where they are unique customers.





If you want to learn more about spatial processing in Alteryx, then I highly recommend using the Interactive lessons such as this one.

Monday, October 14, 2019

We "R" going on a summer holiday!

As the autumn starts to set in and the summer seems like a distant memory, this is a quick blog on how I used Alteryx to plan my family trip around Europe.

For our family summer holiday we managed a tour of France, Switzerland, Italy, Austria and Germany staying in Eurocamp. However there were some challenges in planning the trip:

  • We needed to travel in September - a lot of Eurocamp sites close at the end of the season, so we needed to make sure we didn't get stranded
  • With a young family, we couldn't travel more than 3 hours on any one day (we needed to extend this constraint to 5 hours on some days to make the trip work)

Getting the location and opening dates of the campsites

The Eurocamp website is very comprehensive, but we needed a quick way of planning potential locations without browsing lots of pages. So the initial idea was to scrape the website, but this required knowing the site codes which form part of the URL. 

This is where R (and the pun in the blog's title) comes in.

Along as the website, there is a PDF brochure which gives detailed information for each park.


As you can see from this example from the brochure, it gives the park code which is used in the URL and the opening dates. As default it's not possible to read in a PDF file in Alteryx, however with the R (and python) tool it is possible to extend the functionality of Alteryx, so I was able to use the R package PDFTools, to read in each line of the PDF document as a record.


Once the text is read into Alteryx, it was possible to use the Regex tool to parse out the site name, code (see image), and opening dates.


Once I had the park code, it was possible to scrape each site, as the website contained additional information which wasn't available in the brochure, to get the longitude and latitude of each site.



Once I got the longitude and latitude of each site, Alteryx then makes it easy to see where you can get to from that site. Using the create points tool the long/lat is converted into a centriod and then from that a trade area tool enables a drive time catchment area to be calculated (it was important to use drive time as the trip was going to the Alps, so drive time was more important than straight line distance on alpine passes).

Not being so analytical about the trip

At this stage, I explored a bit more with the spatial tools in Alteryx to work out combinations of routes which could be achieved as part of an iterative macro, however I'd already got a lot of information quickly from the analysis, therefore being a bit more 'manual' about the route selection was important as we could be a bit more flexible on our destinations and we found that we actually needed to be a bit more flexible on our travel time constraint to make sure we didn't get stuck.

So the spatial data from Alteryx was output as a Tableau Hyperfile which meant we could see on a map where all the parks were and quickly understand (by filtering to the end of September) to see where we needed to get to before the majority of the campsites closed!


So that is how a data analyst plans their holiday!

Tuesday, September 17, 2019

Two methods for date scaffolding in Alteryx

What is scaffolding?

There are a variety of blogs online which talk about scaffolding and why it's important, such as this post by The Information Lab. But essentially scaffolding is used to infill dates when the date might be missing from your date, for example you might be looking at staff turnover and you know their start date, and if they are a leaver their end date, but you will also want to know the dates in-between.

So how can you do this in Alteryx?

There are two methods in Alteryx (and potentially other ways as well as there are many routes to solving a problem in Alteryx) which I will walk through. The first is the way I've tended to always do it and the second is a new method I recently discovered, which also has additional benefits.

For the examples the data is structured in the following format:




First method -  Generate Rows



In this first example I use a cross table tool to get one record for my data, with two fields, one for the Start date and a second for the End date. Once shaped my data I use the Generate Rows tool. This enables me to, using a DateTimeAdd formula, create new records for each date between my start and end date. The set up for this is shown in the configuration pane:

Within the config pane, we set up:
1) That we want to create a new field called 'Dates' and that the field type is date
2) The initialisation expression, which is the Start Date, i.e. we want to start from this date
3) The condition is that the new 'Dates' we create are less than or equal to (<=) the End date
4) We loop through creating a new record until the condition in (3) is met by adding 1 day to the 'Dates'. In here we can specify 'seconds' (n.b. the field type would need to be DateTime) through to 'years'.

Then once we run the workflow we will have a record for each day between our start and end dates.

Second Method - TS Filler

The second method uses a macro built in to Alteryx called TS Fill*.

* Despite the tool showing that it is built using R code, shown by the R icon in the bottom left, there is actually no R code in the tool and it mostly uses the Generate Rows method as outlined above, and uses interface tools to increase the functionality of the tool!

With this tool you don't need to cross tab your data to get a field for the start and end date, instead it 'fills in' the missing dates. Therefore using the example data it will treat the first record as the start date and the second record as the end date.

In the config pane you can see that you now have interface tools to select how you want the TS Fill tool to behave.


To set the tool up:
1) Select the Date / DateTime field that you want to create the scaffolding for
2) What you interval is. (N.B. As mentioned above in the * note, this actually uses an interface tool to configure the behaviour of the Generate Rows tool, so if you select 'Years' it will update the formula in step 4 of the Generate Rows tool to be DateTimeAdd([Start],1,'years').
3) Set the increment, so you can deal with creating say fortnightly data by setting the interval to weeks and the increment to 2.

And there you have it, running the workflow will create each day between your start and end date.

Additional functionality of TS Fill

So apart from the TS Fill tool providing a user interface to configure, what lies behind it being the Generate Rows tool, which some may find it being a little easier to use if you find the DateTimeAdd function a bit fiddly to use. The TS Fill tool is useful in 'filling in' any missing dates. So if you find that your data extract is say from a sales system, you may only have records in your data for days where sales occured. So if you want to have a complete set of dates you will need to use the TS Fill tool to add in any missing days, you can then add in those missing date either with null values or depending on your needs to additional analysis to impute (say using average) what those values are.

So if like me you previously did scaffolding using generate rows, hopefully you can now add another tool to your workflows which is a bit more flexible.




Sunday, September 1, 2019

A lesson in collectable cards and iterative macros (vs. alternative method)

I remember collecting Panini stickers in the school playground every time there was a World Cup/European Championships. There would be so much pleasure in the shiny sticker or getting that elusive rare sticker that everyone wanted, and the key part of it was swapping them with your friends.



Fast forward 23 years and my daughter is now collecting the Disney themed cards from Sainsbury's* and while she gets excited about getting an Elsa or Ariel card, feeling nostalgic I have also bought the album to put the cards in.

* You get 1 pack of 4 cards per £10 spend.


Each pack of cards contains 4 cards and there are 144 to collect. So 23 years later and with a inquisitive mind, I wanted to know how many trips to Sainsbury's I might need to ensure my daughter (me) is satisfied with a complete set! 

With a bit of online searching I found this post walking through using code how to solve the problem, so this is where I turn to Alteryx.

Before we start with building the model, some assumptions:
  • There are 4 cards per pack
  • There are 144 cards to collect
  • There is a uniform distribution of cards
  • No duplicate cards in a packet

Building a model for the cards in a packet

So to start I built a iterative macro to assign 4 random cards to a packet. I chose to build an iterative macro for this step as I am assuming that no duplicate cards exist, so the macro checks whether there are 4 unique cards in the packet and if not it repeats until that condition is met.

To get a uniform distribution for the cards I use the RandInt([Cards]) function which will generate a random integer from 1 - 144 ([Cards]).

Building a macro to fill the book

Now we've got a macro which will randomly assign the cards to the a packet, we now need to 'open' the packets to fill the book.

Here I also use an iterative macro, with a join to match the cards being generated by the first macro against the required cards in the book. When there is a match the count of the required card increases by one. There is then a check to ensure that there is at least one of each card and if that test fails another iteration is initiated which 'opens' another packet of cards.

When the condition is met the macro solves and outputs to the user the number of iterations to be run.

But the 'answer' changes every time this is run

So running the macro above gives an answer, but when running the macro again I get another answer which may be close to the first run but can also be further away. So this leaves me still unsure of how many packet of cards I need to complete my set.

If we think of this in probability terms; if I am very luck then I will get no duplicate cards and I will need to get only 36 packets, but given the random allocation of the cards this outcome is very unlikely. In the worst case I will get loads and loads of duplicates and never finish my book. These two examples are extremes, so to get an estimate of the expected number of packets to complete I need to turn to simulation to run many many scenarios and then get an average and distribution which means I can better estimate what's required.

To do this in Alteryx I need to turn to a batch macro, as I want to run the macro above many times.

What this essentially does is run the 'Find Cards' macro and outputs the result and then repeats for the number of simulations I have specified. This batch macro is then set up in my master workflow as set out below.


So all of that to end up with a very simple workflow, but all of the magic is hidden behind that blue dot!

I get my answer... but we have a problem

So as shown in the example above I decided to run 100 simulations of the model, which worked and I was able to get an answer, but it took a long time!


Running 100 simulations took just under 20 minutes to complete. While this gives me an answer for how many packets I would need to complete a book, an average of 189 packets (!!), there were scenarios which required upwards of 300 packets and some as low as 142 packets. 



The problem is 100 scenarios has a lot of variation and I would need to run many more scenarios to hone into an answer to know how likely it is that I would need only c.200 packets or am I more likely to need 300+? Scaling up the original solution would take many hours.

So how can this be improved

So I admit approaching this problem with nested iterative macros within a batch macro seemed sensible from the outset and I was happy with the speed that one iteration solved in, but the simulation stage became the limiting factor. 

In December 2018, along with some fellow Alteryx users I participated in Advent of Code, where we tried to solve some complex challenges with only using Alteryx. One of the things myself and the others discovered doing these challenges is that an iterative macro isn't always the quickest way to solve a problem, so with that in mind I sought out an alternative approach.

To simplify it a bit to make the alternative approach I had in mind, the assumption that each card in a packet is unique is removed.

Alternative approach - creating a large set of cards

So first off, using the generate rows tools I decided to create 1,000 numbers set of cards and assigned the RandInt() function against this. This results in 4,000 randomly allocated cards which should be more than enough to ensure that I have at least 1 of each of the 144 cards represented*.

* From the steps above a full set of cards should on average be achieved with c.200 packets, so a 5x increase should satisfy that (although there is a very very small chance that a complete set wouldn't be present in 1,000 packets!!!).

Then using the Summarise tool for each of the 144 cards I find the first packet that card appears in (as across 4,000 cards it's likely that each card will appear multiple times). By joining the results of this to the 144 cards, with another Summarise tool, you can then find the highest numbered packet required to complete the set.




This approach is set up as batch macro, which allows me to then simulate the outcome many times over, in the workflow below.

Notice how this is much simpler than the nested iterative macros...

So what is the performance of this alternative approach?


FAST!

So on a like for like comparison of running the simulation 100 times the workflow finished in under 3 seconds, which is around a 400x improvement in speed! Given it is so quick, I can now achieve my final requirement of being able to do many many simulations. Here I could do 100,000 simulations and it completed in half the time of the nested iterative macro.


With 100,000 iterations you get a nice distribution of outcomes and I can better answer the problem.


So the average is still around the 200 mark, but I can confidently estimate that 75% of the time I'll need 225 or fewer packets to fill the book.

So I know I should expect to spend £2,250 on shopping to finish the book, but what are the take aways for using Alteryx better?

The Alteryx Engine is fast and Alteryx is really good for processing records line by line. This second approach maximises this as it essentially creates all the information it needs (i.e. opens 1,000 packs of cards at the same time) and then summarises it and funnels it down. 

Compared to the iterative macro approach which ends up solving it in a similar way to how you would in real life (open a packet one at a time). This can make the process time consuming as we've seen in the results sometimes you need to open 100 packets and other times you need to open 500.

As I also discovered when participating in Advent of Code, sometimes the most intuitive approach isn't the best either in terms of runtime or in your ability to identify issues. However by building the iterative approach first, it meant I could fine tune the logic and know the parameters required, so overall development time was only marginally increased (lots of time on building the iterative approach meant I quickly built the alternative approach).


Thursday, July 18, 2019

Solving problems across different data preparation tools

Lot of different tools to do the job, and maybe you can learn something new

Lots of people are wedded to certain tools for their data work, and that may be due to what resources they have at work or sticking to what they know. I get to use lots of different tools each day and wanted to demonstrate how each of them perform for data preparation challenges with no bias over preference in the tools.

So this is the output of this work, a whole 45+ minutes walking through me solving two data preparation problems across three different tools (Alteryx, PowerQuery and Tableau Prep). I've done this live so there are a few mistakes here and there (so please be kind on those).



Either watch the video throughout, or just skip to the sections in the video that cover a tool you've not used before (and hopefully learn something new).

What I found useful about going through this process is an appreciation that no tool is perfect. Something that takes you a few steps to do in one tool can be completed in one step in another, or one tool has a better user interface / description / guidance notes and that means depending on the person a transformation process might just click better for someone in another tool.

There's also something around data literacy and helping people understand what they are doing in the ETL process, you can see it in the video that there's a step where you want to unpivot your data and the terminology used across the tools varies from cross-tab, unpivot and pivot (rows to columns). So that again just demonstrates that people will understand it in different ways, but if you know what you want to achieve (I call it, when I deliver some training, making your data wide and short or long and thin), then you should be able to move between different tools fairly easily (and just Google the bits you don't know!!).

I need to give a shout out to the resources used in the videos Alteryx Weekly Challenges and Preppin' Data. Like I said in the video, don't feel that these resources are constrained to the respective vendor's tool, as they are a great resource to learn new things or try a different tool.

Oh and another bit I forgot to mention in the video, is the vendors for each of these products are releasing updates on a monthly/quarterly basis so the functionality and features are constantly evolving.

Wednesday, July 10, 2019

First attempt at vlogging - Alteryx weekly challenge

To save time and an extension to my endeavours to completing all of the Alteryx weekly challenges, I've decided to vlog my attempts at them. I hope there's some interest in them, and to make a start I've gone back to the first challenge posted to do a walk through of this. Let me know what you think.


Taking hints from gamers posting on Twitch, I've invested in a video capture device to record the Alteryx stuff and using OBS to do the recording, as my first attempt resulted in files across many devices and then required compiling into a finished product, so at least this way it's all in one place and ready to go straight after I hit 'end recording'.


Friday, July 5, 2019

My first #SportsVizSunday submission

Ok so I know it's not Sunday... But I've just posted my first #SportsVizSunday submission.

I was really keen to do this one as I love cycling and the TdF. The data set for the monthly challenge is available from Data.World and I wanted to focus on the British riders in the Tour, and because the original dataset didn't have the rider's nationality in it I needed to do some research and luckily there was this website listing all of the British riders.

So here's my visualisation:
 Interactive version

Click on the image to view the interactive version.

There was one feature which I played around with a while and finally got to work by using set actions. I wanted the circles for the British riders to stand out by size and colour, but when you selected a specific rider to make the size of the other riders blend in (as demonstrated in the video below):

So to achieve this I created two sets containing 1) a list of all the British riders and 2) the selected rider. By creating a combined set of the overlap between these two sets I could create a set action which, when a single rider was selected it would just show the rider in set 2 and when no riders were selected it would show all the British riders in set 1. This combined set could then be used on size (values in/out the set are assigned different sized circles), while maintaining the colours which were defined by a calculation:


Thursday, July 4, 2019

What I've learnt completing 100 Alteryx Weekly Challenges

So I have finally earned my Community badge for completing 100 weekly challenges.

Community Badges

So while hitting this milestone I thought it would be worthy analysing what's involved and highlighting some hints and tips I've picked up as part of participating in the weekly challenges, as I find these are a great learning resource for covering areas of the software which I may not ordinarily utilise as part of my day job!

Weekly Challenges are not that time consuming!

So firstly I want to highlight that the Weekly Challenges are not too time consuming, and most solutions need only a few tools to complete.  Looking at my solutions, the simplest workflow contains just one formula tool (excluding the text inputs, comment boxes and the browse tools) to the most complex requiring 64 tools (Challenge 165), with an average of 12 tools. I don't know my total time to complete the challenges, but I did my first challenge in August 2018, so it's taken me 307 days to complete 100 challenges, with a few intensive stints and some periods of not many completions.
  Capture

There's several challenges which I've completed in under 2 minutes, and along with a couple of colleagues at work we do the challenges Grand Prix style to add an element of competition.

It allows you to discover tools you may have never used before

So if I take a look at the tool categories I have used across these 100 challenges*, I can see that the majority fall under categories like Preparation, Join, Transform and Parse, but I've also used things like the Calgary tools which I've never used outside of the Weekly Challenges. Capture2Capture3 *at the time of writing I've actually completed 102 challenges, and the macro results therefore reflects this.

Now some tips and tricks

I've been back over some challenges to identify some techniques which may be considered intermediate level to assist with transformations. What is great is once you've used these techniques once and understand how they work you'll end up going back to them again and again.

Text to rows

Coming from Excel, many Alteryx users will be familiar with text to columns where you split a string based on a delimiter to new columns (common where the input is a comma delimited file). However one useful function in Alteryx is split to rows, I have used this on Challenge 139 where a benefit of using split to rows is to keep the data in one field, so the numbers can be parsed out in the next step. Capture4Capture5

I've also talked about how using splitting to rows can make a workflow more dynamic in another blog post, as it doesn't require knowing how many columns you need to split the field into.

Regex tokenise

So similar to splitting to rows on a delimiter is using the regex tool when there might not be a delimiter. In the case of Challenge 59, where I needed to whether words in two fields were an anagram of each other. To solve this, I split every character for each record to a new row, and from there sort the letters and then concat them back and if they match then it's an anagram.




Modulo

The modulo function is used in Tableau when building small multiple charts, and I used it a lot when doing Advent of Code. What modulo gives you is the denominator of the fraction in the remainder as part of a division. For example the modulo of the value 9 when divided by 4 is 1 as 9/4 = 2 1/4. In Alteryx this is done as part of the formula tool and the syntax is mod([Field],divisor). I used modulo in Challenge 21 to restart the numbering of the records every 12th record as this represented each month of the years 2007 - 2008. So here, I had records 1 - 24 and what I wanted was the numbering to go 1 - 12 for 2007 and then repeat 1 - 12 for 2008. So here the formula tool was used to update the RecordID field with the following function:

mod([RecordID]-1,12)+1

This then allowed me to join the numbers 1 - 12 on a text input to get the months January - December.



Reverse String

So this is a string based function which reverses the order of every character in a string. I found this really useful in Challenge 40 which involved parsing web html data. Again coming from Excel I'm familiar with using Left, Right and Mid (substring in Alteryx) functions to return a certain number of characters, but what I wanted to find every character after the last '>' however the position of this varied by record. An example of the full string is as follows:
So from this I needed a way of identifying the position of the > before Brett Abernathy so that I could get the Doctor's name and discard the rest of the string. The resulting formula was as follows:

 
substring([Doctor Name],IIF(FindString([Doctor Name], ">") > -1, Length([Doctor Name]) - FindString(ReverseString([Doctor Name]), ReverseString(">")), -1),length([Doctor Name]))

So this is used FindString to find the position of the ">", but as it works left to right you reverse the string so the last ">" becomes the first.

So to round this all up

I hope you've found these techniques are useful and will be able to use them, so go ahead and use them in your day to day Alteryx work and in weekly challenges and it encourages you to attempt more weekly challenges to discover some of your own new techniques.