“clear glass Erlenmeyer flask lot” by Elevate on Unsplash

Testing a Hypothesis to Improve Alteryx Performance

What’s faster than an integer? Certainly not text strings; those things are slow. String comparison usually involves some sort of loop to compare the individual characters of one string to another. Integers? Those can be compared right in the CPU registers at the hardware level, no looping required. If you want to compare two values quickly, integers are a great way to go.

Want to hazard a guess how most SAP primary keys are stored? Hint: it’s not integers. This usually is not a problem; many of the datasets I interact with are sufficiently small that I haven’t even thought about performance. But lately we have been developing FI-focused datasets, which means we need to touch the BSEG and FAGLFLEXA tables. For context, a year or two of data from these tables can easily contain several hundred million records. Our workflows which perform the data prep take a very long time to run and I am looking for ways to improve performance. One possibility involves integers: my hypothesis is that we can improve performance of our workflows by using integer keys rather than the native string keys. Sorts, joins, and lookups should be faster, allowing the workflows to execute more quickly.

Converting strings to integers

I faced an immediate challenge to testing my hypothesis: many SAP keys contain non-numeric characters. How do I go about converting alphanumeric strings into a 64-bit integer? After a day or two of thinking about numbers, letters, bits, and bytes, I finally decided on treating text strings as a base 40 integer which I could convert to a base 10 integer, very much like hexadecimal strings. My arbitrary mapping of base 40 characters looks like this:

Obviously, there are limitations in my process: It will not convert lowercase letters, nor will it convert punctuation other than the 4 characters I arbitrarily chose. These limitations are acceptable to me because our SAP keys abide by those terms.

The algorithm for converting between different bases is actually quite simple. For a base 40 to base 10 conversion, it looks like this:

Now, if you’re like me and you have trouble reading this kind of notation, or I wrote it wrong and it’s just unreadable, an example would help. Let’s convert the string A1B from base 40 to base 10 using my mapping provided above. We start at the right-most character, which is B. This is position 0 in the string. The value of B is:

The next character, in position 1, is 1. It’s value in base 10 is:

The final character, in position 2, is A. It’s value in base 10 is:

Adding up all three values gives us a base 10 integer of 22,455. See? Easy algorithm; it shouldn’t be too difficult to code this.

Integer limitations

After figuring out the algorithm thanks to a helpful website, I ran into another challenge. The largest integer in Alteryx is int64, which is a 64-bit (8-byte) integer. The size of this integer limits the maximum string size I can convert from base 40 to base 10. The largest value an unsigned 64-bit integer can hold is 18,446,744,073,709,551,615, which equates to a 13-character-long base 40 string of BB_M HP3R24F1. This is an awkward upper bound, so I arbitrarily limit the conversion to 12-character string ZZZZZZZZZZZZ. This means I can compress a 12-byte character string into an 8-byte integer, which is pretty cool.

Going back to the maximum integer size: did you catch where I mentioned it related to unsigned integers? Unsigned integers do not have negative values; their minimum value is 0. Signed integers, on the other hand, can be negative. Alteryx does not have unsigned integers; int64 is signed. This means, rather than a range of:

0 to 18,446,744,073,709,551,615

the range is actually:

-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Complicating matters further, converting from unsigned to signed integers is not a simple, safe operation. In addition, the base 10 value of ZZZZZZZZZZZZ is 16,777,215,999,999,999,999, which is too big to fit into a signed integer. My approach to deal with these issues involves the following steps:

  1. Calculate the base 10 value of the base 40 string using an unsigned integer
  2. Split the unsigned integer into two signed integers by dividing by 2
  3. Add both signed integers to -9,223,372,036,854,775,808

Now, finally, I have all of the ‘thought’ infrastructure in place. Time to do some coding.

Implementing the conversion in Alteryx

My first thought for implementing this conversion in Alteryx was to use the C++ forumla add-in API. I would create a custom formula, called SimpleStrToInt(), which I could call from a formula tool. After hours of debugging I finally got something working, only to find that the formula was incredibly slow. So slow, it was obvious up front that the idea was a non-starter. The issue could be Alteryx’s hand-off between the raw data and the formula API, it could be my complete unfamiliarity with C++, or it could be a bit of both. Whatever the reason, the C++ API was not going to work…much to my relief.

I decided to try again, this time implementing the conversion in a custom tool. Alteryx has 2 SDKs for custom tools: Python and .Net. I chose the .Net SDK because it is much closer to the hardware than the Python SDK and is therefore significantly faster. In this project, speed matters.

I like to use a test-driven approach whenever I code, so I developed the logic separate from the Alteryx SDK. In this way, I could verify my code was working exactly as expected before I dealt with the complexities of hooking into the Alteryx engine. The finished logic looks something like this:

Let’s start at line 17, which is the main conversion function. It’s fairly short, 13 lines long, so you should be able to see the algorithms and processes we already discussed. Lines 19 and 20 perform some error checking and return null if the string is not in the expected format (I know, I know. I rag on null a lot, but it makes sense here). Lines 22 to 30 are the base 40 to base 10 conversion algorithm. We loop through each character, make sure it’s a valid character, and then calculate its base 10 value. In lines 32 through 34, the unsigned integer gets split up and added to the int64 lower limit. That’s it, the core of the conversion algorithm. The rest of the file is a custom function I created to calculate exponents on integers and my base 40 value mapping. The final tool looks like this in Alteryx:

I won’t go through the the code to actually build the tool in Alteryx; it’s really just plumbing. If you want to see all of the code, you can browse it on GitHub. I need to give a huge shout-out to Alteryx here for the SDK. The plumbing was a small part of this project, which is a good thing. I was able to spend most of my time thinking about the problems I was trying to solve rather than how to use the SDK.

Testing the hypothesis

That hypothesis I had seems like a long time ago, doesn’t it? Now we are at a point where we can begin testing it. I designed the test in the following manner:

  • I will record how long it takes Alteryx to import a yxdb data file and sort the data.
  • The data files contain 4 text fields. The first field contains 4-digit numeric strings, the second field contains 10-digit numeric strings, the third field contains 4-digit alphanumeric strings, and the fourth field contains 6-digit numeric strings. Integer keys will be created from the first 2 fields and the last 2 fields. I chose this structure because it mimics the keys I work with in SAP. Here is an example of the input data:
  • Half of the data sets will contain the integer keys for the integer sort test, the other half will be for the string sort test and will not contain the integers. I do not want to penalize the string sorting with extra fields. This is a cost the integer sort will have to overcome.
  • There will be 3 sets of data: 10 million records, 100 million records, and 1 billion records. This should give us a picture of how data size affects the time to sort. I work with datasets approaching 1 billion records, so this seems like a good stopping point.
  • There will be 3 sort tests: String sorting, integer sorting, and integer sorting after the integer keys are added to the data. This will give us a baseline comparison of integer vs string, as well as an idea of the cost of adding the integer keys to the data.
  • I will run each scenario 3 times to try and mitigate the influence of external factors from the results.
  • I will run each scenario on data that is randomly distributed, already sorted in ascending order, and already sorted in descending order. If Alteryx can detect presorted data and optimize its algorithm accordingly, I want to see the impact on integer vs string sorting.

The results

Here are results for all of the tests run on the 10 million and 100 million count datasets:

Ok ok, it’s probably the worst chart you’ve seen today, but I found it useful just to get a high-level sense of the data. Perhaps more helpful is to compare the average time per scenario:

As we can see here, straight integer sorting is faster, although on small datasets hardly noticeable. As the data gets larger, integers become more effective but a worrying trend is that the time it takes to generate the integers significantly increases, too. There is a lot of one-time cost to get those integers added to the data.

Let’s look at a bit more detail and see how the order of the data affects speed of sorting:

As expected, randomly distributed data takes a lot longer to sort than sorted data. Whether the data is already sorted in ascending or descending order does not seem to impact performance. So the primary driver of performance is how randomly the keys are distributed. And again, we see that the cost of adding keys is quite significant.

One interesting takeaway by looking at the results of the 10 million and 100 million row datasets side-by-side is that integer performance is scaling mostly linearly. The randomly distributed string performance is almost linear as well, but the presorted string performance is not. I really don’t know how important this is, but it’s something I am putting in my back pocket to think about later.

Finally, you will notice that these charts have not included the 1 billion row dataset. I ran 1 test of the randomly distributed 1 billion row dataset and compared it to the other 2 datasets on a logarithmic scale:

My conclusion is that 1 billion rows behaves very similarly to 10 million and 100 million rows, only more so. Because it takes over an hour to run a single test on 1 billion rows, I stopped here to save time.

Conclusion

Honestly, this wasn’t the answer I was hoping for. Integers are obviously faster than strings, but there is a significant upfront cost of generating the integers. This situation seems to be a classic break-even analysis. Unless there are a lot of sorts or joins over which the fixed cost of generating the integers can be spread, the cost of creating the integers does not make sense. At this point I need to decide which, if any, of my datasets would benefit from adding integer keys. There may also be opportunities for optimizing the algorithm which generates the keys to make it faster. But at this point in time, the answer seems to be that adding integer keys does not make sense for the vast majority of the data I use.