I was talking recently with a fellow analytics expert I know from my ACL days. The question came up: does Alteryx have a function to calculate the dice coefficient between two text values? It was a great question, because while Alteryx has several fuzzy matching algorithms, dice coefficient is not one of them. Also, I had forgotten about dice coefficient and suddenly wanted it in my Alteryx tool belt. After a brief search I found a Python implementation of the algorithm and decided to integrate it into Alteryx. In a stroke of defiance against myself, I decided to use standard Alteryx tools and avoid creating something custom.
Unfortunately, the standard Alteryx tools did not work out so well. I started with the Python tool, which embeds Jupyter notebooks into an Alteryx workflow. The algorithm I found earlier fit nicely with minor additions and one fix to the logic:
I tested the logic against some text and it worked great. The next step was a stress test. I packaged the Python tool into a macro and sent it to my work laptop. The stress test used a dataset of 10,000 phone numbers from my company’s master data. This would mean calculating the dice coefficient for 100,000,000 unique combinations.
The first issue I ran into was the Python virtual environment. Alteryx seems to have changed the name of the default Python Tool virtual environment between the different releases on my two laptops. It was a minor issue that was quickly resolved, but one which hurts the portability of workflows using the Python tool. However, that was not the only issue.
The biggest drawback of the Python Tool solution was an out-of-memory error during the stress test:
The Python Tool, during the call to
Alteryx.read("#1"), apparently reads all of the incoming records into a single dataframe. I do not know of any way to ask Alteryx to batch or stream the records, so the Python tool will run out of memory on large datasets. 100,000,000 records may seem a lot, but it represents a cross-join of only 10,000 records.
If I want to use Dice Coefficient on large data sets (I do), the Python Tool will not work for me. This takes me to another option to use standard Alteryx tools to re-create the Python code without any code. So I built a macro to recreate the logic as best I could using standard, non-code Alteryx tools. I made it this far before I gave up:
Even the above macro has edge cases which get the wrong Dice Coefficient value. The
while loop in the Python code is particularly difficult to implement in an Alteryx workflow. I say, ‘gave up’, because I believe with enough time and effort I could figure out how to express the algorithm in standard Alteryx tools. My objection here is one of readability and expressiveness. It is much easier for me to understand what is happening in the Python code than my own macro. Adding additional complexity to the macro to properly reflect the
while loop logic would compound the readability issue. Sometimes, you have to know when to stop.
In addition to the readability issue, the macro with standard tools will not likely be performant on large datasets. Each bigram in the text is extracted into its own row. This means a dataset of 100,000,000 comparisons could easily hit billions of records of bigrams that have to be sent through multiple joins. Even with the Amp engine there is going to be a lot of overhead.
If standard code and code-free tools do not work, that leaves us with custom tools to solve the problem. This is a classic example where custom tools solve the problem more elegantly and with greater performance. The Python code for dice coefficient fits easily into the custom tool framework, as you can see on my GitHub repository. This means I get the benefit of code’s superior expressiveness for this particular algorithm. But does the custom tool give me a performance benefit? Can I actually calculate the dice coefficient for 100,000,000 records? Yes, I can, in just over 25 minutes:
Furthermore, there was no memory pressure on my system while I ran the workflow. Custom tools can do something the Python Tool cannot: stream incoming records. Now, you may object that 25 minutes is still a long time to calculate the dice coefficient, even if it does work. And you would be correct. So I decided to create a second custom tool using my beta Go SDK to see what kind of difference there would be in execution time:
Four minutes to calculate the dice coefficient on 100,000,000 records is stupid fast. This tool makes dice coefficient cheap. I do not have to worry about using it wherever and whenever I want. As for expressiveness, it is still there. I personally find it easier to follow the Go code, but I will allow for the fact that I am likely biased about my favorite programming language.
Ultimately, these are the reasons I make custom tools. Custom tools can leverage code to solve problems where code is actually more expressive than no-code solutions. And unlike the built-in code-friendly tools, they have fewer performance limitations and can achieve higher levels of throughput. They enhance my toolbelt and make me a better analyst, able to solve business problems better and faster than I would without them. I may have embraced the no-code philosophy that makes Alteryx both powerful and user friendly, but I will never abandon the superpowers I can achieve when code is added to the mix.