Sunday, September 1, 2019

A lesson in collectable cards and iterative macros (vs. alternative method)

I remember collecting Panini stickers in the school playground every time there was a World Cup/European Championships. There would be so much pleasure in the shiny sticker or getting that elusive rare sticker that everyone wanted, and the key part of it was swapping them with your friends.

Fast forward 23 years and my daughter is now collecting the Disney themed cards from Sainsbury's* and while she gets excited about getting an Elsa or Ariel card, feeling nostalgic I have also bought the album to put the cards in.

* You get 1 pack of 4 cards per £10 spend.

Each pack of cards contains 4 cards and there are 144 to collect. So 23 years later and with a inquisitive mind, I wanted to know how many trips to Sainsbury's I might need to ensure my daughter (me) is satisfied with a complete set! 

With a bit of online searching I found this post walking through using code how to solve the problem, so this is where I turn to Alteryx.

Before we start with building the model, some assumptions:
  • There are 4 cards per pack
  • There are 144 cards to collect
  • There is a uniform distribution of cards
  • No duplicate cards in a packet

Building a model for the cards in a packet

So to start I built a iterative macro to assign 4 random cards to a packet. I chose to build an iterative macro for this step as I am assuming that no duplicate cards exist, so the macro checks whether there are 4 unique cards in the packet and if not it repeats until that condition is met.

To get a uniform distribution for the cards I use the RandInt([Cards]) function which will generate a random integer from 1 - 144 ([Cards]).

Building a macro to fill the book

Now we've got a macro which will randomly assign the cards to the a packet, we now need to 'open' the packets to fill the book.

Here I also use an iterative macro, with a join to match the cards being generated by the first macro against the required cards in the book. When there is a match the count of the required card increases by one. There is then a check to ensure that there is at least one of each card and if that test fails another iteration is initiated which 'opens' another packet of cards.

When the condition is met the macro solves and outputs to the user the number of iterations to be run.

But the 'answer' changes every time this is run

So running the macro above gives an answer, but when running the macro again I get another answer which may be close to the first run but can also be further away. So this leaves me still unsure of how many packet of cards I need to complete my set.

If we think of this in probability terms; if I am very luck then I will get no duplicate cards and I will need to get only 36 packets, but given the random allocation of the cards this outcome is very unlikely. In the worst case I will get loads and loads of duplicates and never finish my book. These two examples are extremes, so to get an estimate of the expected number of packets to complete I need to turn to simulation to run many many scenarios and then get an average and distribution which means I can better estimate what's required.

To do this in Alteryx I need to turn to a batch macro, as I want to run the macro above many times.

What this essentially does is run the 'Find Cards' macro and outputs the result and then repeats for the number of simulations I have specified. This batch macro is then set up in my master workflow as set out below.

So all of that to end up with a very simple workflow, but all of the magic is hidden behind that blue dot!

I get my answer... but we have a problem

So as shown in the example above I decided to run 100 simulations of the model, which worked and I was able to get an answer, but it took a long time!

Running 100 simulations took just under 20 minutes to complete. While this gives me an answer for how many packets I would need to complete a book, an average of 189 packets (!!), there were scenarios which required upwards of 300 packets and some as low as 142 packets. 

The problem is 100 scenarios has a lot of variation and I would need to run many more scenarios to hone into an answer to know how likely it is that I would need only c.200 packets or am I more likely to need 300+? Scaling up the original solution would take many hours.

So how can this be improved

So I admit approaching this problem with nested iterative macros within a batch macro seemed sensible from the outset and I was happy with the speed that one iteration solved in, but the simulation stage became the limiting factor. 

In December 2018, along with some fellow Alteryx users I participated in Advent of Code, where we tried to solve some complex challenges with only using Alteryx. One of the things myself and the others discovered doing these challenges is that an iterative macro isn't always the quickest way to solve a problem, so with that in mind I sought out an alternative approach.

To simplify it a bit to make the alternative approach I had in mind, the assumption that each card in a packet is unique is removed.

Alternative approach - creating a large set of cards

So first off, using the generate rows tools I decided to create 1,000 numbers set of cards and assigned the RandInt() function against this. This results in 4,000 randomly allocated cards which should be more than enough to ensure that I have at least 1 of each of the 144 cards represented*.

* From the steps above a full set of cards should on average be achieved with c.200 packets, so a 5x increase should satisfy that (although there is a very very small chance that a complete set wouldn't be present in 1,000 packets!!!).

Then using the Summarise tool for each of the 144 cards I find the first packet that card appears in (as across 4,000 cards it's likely that each card will appear multiple times). By joining the results of this to the 144 cards, with another Summarise tool, you can then find the highest numbered packet required to complete the set.

This approach is set up as batch macro, which allows me to then simulate the outcome many times over, in the workflow below.

Notice how this is much simpler than the nested iterative macros...

So what is the performance of this alternative approach?


So on a like for like comparison of running the simulation 100 times the workflow finished in under 3 seconds, which is around a 400x improvement in speed! Given it is so quick, I can now achieve my final requirement of being able to do many many simulations. Here I could do 100,000 simulations and it completed in half the time of the nested iterative macro.

With 100,000 iterations you get a nice distribution of outcomes and I can better answer the problem.

So the average is still around the 200 mark, but I can confidently estimate that 75% of the time I'll need 225 or fewer packets to fill the book.

So I know I should expect to spend £2,250 on shopping to finish the book, but what are the take aways for using Alteryx better?

The Alteryx Engine is fast and Alteryx is really good for processing records line by line. This second approach maximises this as it essentially creates all the information it needs (i.e. opens 1,000 packs of cards at the same time) and then summarises it and funnels it down. 

Compared to the iterative macro approach which ends up solving it in a similar way to how you would in real life (open a packet one at a time). This can make the process time consuming as we've seen in the results sometimes you need to open 100 packets and other times you need to open 500.

As I also discovered when participating in Advent of Code, sometimes the most intuitive approach isn't the best either in terms of runtime or in your ability to identify issues. However by building the iterative approach first, it meant I could fine tune the logic and know the parameters required, so overall development time was only marginally increased (lots of time on building the iterative approach meant I quickly built the alternative approach).

No comments:

Post a Comment