Postmortem of “Returning to CloudPOS” EFTPOS issue of October 26

On the 26th of October 2017, we experienced a widespread bug that resulted in PC-EFTPOS integrated EFTPOS not working upon the second transaction attempted per page load.

The incident was caused due to faulty code introduced to our production server while implementing a new feature. We believe this affected approximately one-quarter of our customers.

We apologies to all of our customers that were affected and this article explains why the issue occurred and what we are doing to prevent this from happening in the future.

The Feature

When you use our EFTPOS integration, we load the integration page in what is known as a “frame”, this is essentially another webpage loaded inside the current webpage. We then send and receive data from that page to process the EFTPOS transaction.

Upon finalising the sale and selecting EFTPOS as the payment method, we then reload and display the frame. This works well for the most part as the frame is cached so it can work offline, however when you have an internet connection that is slow or drops in and out you may notice a white flash for a couple of seconds when the page loads which results in slower transaction time.

Our solution to this problem is to prevent the page load from occurring after the first load, which leads to our problem.

The Problem

For this change to work, we had to rewrite the way we handle integrations and had to modify how the integration communicated with the POS. We tested with both an existing install and a new install and after determining that the changes worked as we expected we put the code into production.

However, the code didn’t work as expected due to how we cache data in the register for when you go offline.

Upon visiting the register for the first time, we complete a “full synchronisation” of your data and we also cache a large amount of code that allows the register to operate when offline. Upon future requests (page loads) we then serve the code from the cache and we update the code in the background (which means that the following request then uses the latest code). We use this method as it results in swapping back to the register quickly from any page and being able to work instantaneously when offline rather than having to wait for a request to fail before swapping to the cached code.

So, you may have spotted the issue already. We refresh the cache when you visit the page, that means that the register code would be updated before the EFTPOS integration code if you did not complete an EFTPOS transaction when the page first loaded (the code would be using different versions). This resulted in the issue of “Returning to CloudPOS” as the EFTPOS integration was waiting for the page the reload.

The Solution

Upon learning of the problem, we instantly rolled back the code, which prevented the problem from occurring on additional customers, however this didn’t fix the problem for people currently experiencing the issue (for the same reason as above). We then made additional code changes and uploaded a fix (at the point of writing we have not yet fully implemented the original feature as it is undergoing rigorous testing in all environments that we can find) which attempted to force the browser to update the entire cache. We then also published a way to manually update the cache on this blog.

Publication

We first shared the post on the blog, however to reach the most amount of clients we posted the fix on our Facebook page and we also made use of the “Team Message” to display the fix to all clients.

Future Prevention

In order to prevent this from occurring in the future we are implementing version splitting across Vendors (which was the plan from the start once we reached enough clients for this to become viable). This will mean that certain clients will run on the “latest” code, other clients will run on “beta” code and all other clients will run on “stable” code. New code will then have to pass through from “latest” to “beta” before making its way to “stable”. While this will mean slower roll-out of new features, it will result in errors being picked up earlier.

We are also implementing automatic tests for our EFTPOS integration, we already have a large amount of tests that are constantly increasing for other areas of the POS, however we had not yet implemented tests for our EFTPOS integrations due to the additional requirement of EFTPOS hardware.

Conclusion

Once again, we sincerely apologies for any inconvenience this has caused and we are hard at work to prevent issues like this from happening in the future.