The key to improving your code may be right under your nose
In the world of software processing capability, speed is an “absolute good” – you could describe it as the most important metric. On a machine equipped with a modern CPU and plenty of RAM, a program that accesses simple data types in small quantities will probably run just fine. However, when dealing with programs that handle large amounts of data, the program’s architecture (high level design and design philosophy) has a clear impact on processing speed.
Sometimes, software engineers without much experience can become so fixated on meeting requirements that processing speed becomes an afterthought. This story involves the first time I was solely entrusted to develop a subsystem, around six months after joining Kikusui. I will describe how I overcame a challenge I encountered concerning processing speed. Hopefully some readers will find it helpful.
Developing Data Migration Tools
The subsystem I was tasked with working on migrated data to a new system so that the existing system could be upgraded. As was the practice at the time, I was to create a set of procedures (code that performs several processes at once) in SQL that would be run sequentially by a batch file, to convert database data and migrate it to the database. While I initially thought this would be a simple program to write (such baseless optimism is not uncommon amongst novices), when I started considering not just the need to migrate data but also issues like maintenance and usability, I realised that processing time would present a major hurdle.
When I began work on the project, I foolishly thought that converting the data would be simply a matter of applying a simple set of rules. However, as the job progressed, it became evident that things would not be that simple. I had hit a wall in the form of processing time. Long processing times not only impede the usability of the software, creating stress for users, but also make it harder to make changes when there is a problem, not only during debugging but also when the software is live.
Why My Program Ran So Slow
There were two main reasons for the increased processing time.
(1) The existing system used irregular data specifications and multiple data formats.
The existing system left users free to edit formats and create rules, enabling them to create whatever type of data they wanted. To accommodate this, I had embedded various conversion programs into each of the database tables so that all user data could be migrated to the new system. This created a lot of conditional branching which slowed things down more than expected. In fact, not only was the conversion of data slow, but the importation of data specifications from the existing system was also time-consuming.
(2) A huge volume of data had to be migrated.
The database in question comprised over 100 tables, each containing millions of records. Because of the large number of records involved, when the process was executed, processing time increased in proportion to the number of records. While this was not so noticeable during debugging because I used sample records to cut down the amount of data being processed, I was stunned to find that the first integrated test of the code took over half a day to complete. While I initially hoped to resolve the speed issue be resolved by redesigning the database used in the new system, because I was already committed to a particular data conversion specification, all I could really do in terms of redesign was tweak the table indices. I was unable to make the program run any faster. This newbie engineer was stumped.
Another Angle
The situation initially seemed hopeless. After some thought, however, I decided that if the situation really was hopeless, I needed to change my approach and just try anything and everything in the hope that something would work. I decided I would split and add tables in order to distribute processing. Because splitting processes and adding new ones actually increases the amount of processing to be performed, one would expect this approach to slow the program down even further. While I knew that switching to a distributed process was possible, I thought that the splitting and addition of tables would not produce any benefit upon integrated testing because of the increase in programs to execute. When I tried this approach, however, I found that the process ran 10 percent faster.
When I stepped back from the problem, it dawned on me that the huge size of the existing database meant that the program took a long time to simply access all the records. I realised that if I split the tables in advance, thereby distributing database access time between several processes, I could generate a marked improvement in processing time. Once I realised this, I started making faster progress, and was able to improve all the tables. Before long, I found another area where improvements could be made. It involved SQL’s capability to extract data from multiple, joined tables, and on multiple levels. Because I had simply followed the rules on data conversion and migration when writing my original SQL code, my procedure was full of linked tables. This meant that when the program appeared to be accessing actual database tables, it was in fact accessing a virtual table comprised of results returned by SQL queries. This meant that processing time for a given number of records was significantly longer than if the procedure had been written in simple SQL. Seizing on this fact, I split up the existing tables and added a new, intermediate table, eventually achieving a 30% reduction in processing time. At last, my mission was completed.
Turning the Problem Around
When I first started out as an engineer, I only focused on meeting project requirements (so as to eliminate the need to make modifications later) and did not consider issues like processing speed. However, my experience with the database made me keenly aware of the need to consider all angles when writing code. While we engineers tend to assume that adding more programs will make for slower execution, sometimes it is possible to speed up a process dramatically by focusing on the distribution of processes. This applies not only with SQL but also with other programming languages. Coming up against a brick wall can actually be an opportunity to reset your thinking and consider the problem objectively, from a different angle.
When we get stuck, we tend to get tunnel vision without realising it. While the answer to our problems is normally a matter of changing perspective, those working alone seldom realise this initially. That is why I believe that development should be performed not by a single engineer but by the whole team, right from the initial design and review stage. Noticing problems faster will cut down the amount of reworking, and reworking halts development. I would urge any new developers reading this to take this advice on board.