AWS Startups Blog
Tests Not Included: How LoanStreet Built a PPP Platform In One Week
Guest post by Joel Feinstein, Principal Software Engineer, LoanStreet
LoanStreet is the first fully-integrated, online platform that streamlines the process of sharing, managing, and originating loans for credit unions, banks, and direct lenders. Many of LoanStreet’s clients lend to small businesses and individuals, those most in need of funding from the Paycheck Protection Program (PPP), and unable to snag a piece of the initial $310bil.
It was obvious that we had to do something to help our customers. Our clients were relying on us to get their loans funded. The only catch: a hard deadline of one week. It was the sort of do-or-die situation that makes adrenaline course through the veins of a serial entrepreneur – the prospect of coffee-filled nights and the potential to bring an impactful product to market.
The mandate for the engineering team was to deliver an application from a borrower to the Small Business Administration (SBA) via an API. The contents of the application were subject to much confusion and were modified numerous times. Congress was rewriting the rules on the fly, and the SBA was tweaking their process and API in tandem.
To our advantage, LoanStreet had a nearly complete pre-release comprehensive commercial lending product, designed to support collaboration between borrowers and lenders for a broad spectrum of complex loans. We (the engineering team) determined that core parts of this solution could be repurposed to serve as the basis of LoanStreet’s PPP platform.
The primary components were standard: a TypeScript frontend, Django backend, and Celery workers. It was simple enough to slice out the existing business logic and replace it with whatever the business and design teams deemed necessary. [We stressed to them that by “necessary” we meant spaghetti without sauce.]
A scalable production-ready environment was quickly pieced together using CloudFormation templates. Our compressed deadline meant that we needed components that were simple, scalable, and required zero maintenance. The code was already dockerized, so Elastic Container Service (ECS) was a natural fit. The Fargate launch type meant that we didn’t need to spend precious time fiddling with individual instances. Relational Database Service offered a serverless Aurora option, which would allow our infrastructural components to scale together in harmony.
However, the programmatic integration with the SBA was entirely new to our organization. An XML-based SOAP API was provided, around which we authored a quick wrapper using Zeep and some DTOs for communicating with our existing code. While Zeep handled the majority of the interaction with the remote API, it became painfully clear that the complexity of the task lay within the interpretation of the API specification. The SBA provided accompanying XSD used regexes to validate the input, but questions such as “90 or 0.9” were common. In some cases, the question was in the interpretation of a given situation, such as classifying a solo practitioner as an individual or a business with a single owner. Both were valid, in theory, but only one resulted in a funded PPP loan.
It came as quite a surprise when the SBA decided to terminate their development fleet the weekend prior to launch. The stated reason was that their API clients were load testing against the development servers, resulting in flaky behavior across the board. Everybody involved in the PPP process expected a massive volume of traffic once the funds became available to the public, volume far exceeding what the SBA may have seen prior to COVID-19. It made sense then that resources were being directed towards battle-proofing the production environment rather than fixing the development environment.
LoanStreet engineers spent the weekend prior to launch tuning code to pass the XSD spec. There was no way to know if the program would actually work on April 27th because there was no way to test the business logic. Regardless, the switch was flipped promptly at 10:00 AM New York time. Few engineers balked at the literal 0% success rate.
The arduous task of debugging the submission process began. The vast majority of failed submissions had experienced some sort of connection issue. Every client of the SBA service was submitting requests as quickly as possible, at the exact same time, so the expected cause was DDOS. LoanStreet’s platform included a “down detector” of sorts, which worked by attempting to authenticate with the SBA’s service. It was constantly reporting “down”.
A few requests had somehow connected, providing a sample of logical and rounding errors that required corrections. Fixes were swiftly staged and deployed, and the existing applications were placed back into the queue for resubmission. These failed again, with additional errors, and the cycle began anew.
It became apparent after a few iterations of this process that the concept of staging had become a hindrance. These fixes needed to be in production immediately, and the staging deploy process was consuming valuable time. It was decided that the best option was to remove staging altogether, which was accomplished with a few clicks in CodePipeline. While LoanStreet normally employs a strict QA process, our best practices had temporarily shifted to promote immediate functionality, as the priority was to fund loans and not long-term stability.
More fixes were followed by resubmissions. The PM suddenly piped up over Zoom: “check your email!” There was a field in the API with a vague description that called for an email address. It was unclear if it was for the lender, borrower, or software vendor. Lacking clear guidance from the SBA, we’d populated it with the LoanStreet support address, and it was to this address that we received notification of our first approved application. The team was ecstatic.
The single successful application was followed by many more. There majority were still failing due to connection issues, providing a hint of a pattern. From the LoanStreet platform’s perspective, the submissions were failing, but there were emails indicating otherwise. It was as if the submissions were processed even after the connection was closed due to something like…a timeout.
The SBA’s development environment, before it was scrapped, had been fast. Consequently, the LoanStreet platform was configured with a connection timeout of one second and, expecting issues, to retry 10 times before bailing. Within hours of the initial launch our team made the decision to remove this timeout. Each single-process Python worker would remain connected to the SBA’s API for as long as necessary. It made sense that one would remain connected to a DDOS target if able to connect at all.
As it turned out, it sometimes took up to 16 minutes for a single HTTP request to succeed. It went against best practices on both the client and server sides. The client should protect its resources using a reasonable timeout, while the server should shed load by terminating connections early. The remote endpoint was clearly accepting the connection and submission but was starved for processing resources. We cranked up our auto-scaling configuration to allow for hundreds of simultaneously hanging requests.
The floodgates opened and every application was successfully funded. A week later LoanStreet’s PPP platform had received approval for over 2,000 applications, and over $50mil. We were heroes in the eyes of our customers, financial institutions serving countless small businesses. We had seamlessly connected to the SBA, transforming the arduous and unknown process of funding a PPP loan into a simple form. As engineers, we had learned valuable lessons in building simple, robust systems, and knowing when to deviate from the beaten path.
Technical Takeaways
- Serverless options excel when you want a low-effort touchless environment.
- Use a mock when load testing, otherwise the remote might cut you off.
- Securely store raw requests and responses when interacting with a remote server.
- Expect external production and development environments to behave differently.
- Create separate deploy paths for each environment.
- Pay attention to math and rounding.
- Engineering standards must adapt to meet the needs of deliverables.