Case Study: Unreliable Search

That couch you're going to lounge on tonight to watch TV, that was conceived by a furniture designer and built by a furniture company. Good chance they got their ideas for textile patterns and materials from my client's website. The trouble is, they haven't been designing very many couches because they can't get my client's website to load. It crashes every time they try to do a search.

Customer flow is the key to Web Reliability (WR.) A customer who can't complete a search on a website is a customer who is not in the flow. When customers don't flow, neither does money. Our client realized this, and called us in to help because their developer couldn't get site search to load, and couldn’t find a solution.

Getting Stuck

My client's website, let's call it ghchaise.com, was built on a CMS that we have expertise in. My company even developed a performance evaluation service where we could troubleshoot clogs and bottlenecks in a web property and resolve them. My client reached out to me because their developer had gotten stuck and couldn't get site search to load.

At the time, the Web Reliability Framework did not exist. If it had we would have used it to troubleshoot the problem. The experience I gained from this client is part of why there is such a thing as the Web Reliability Framework. So let's retroactively apply the framework and see if it can get us to a solution.


Clogged Search

Our client – let’s call her Claire - was under a huge amount of stress when she called us. Revenue was down. Money wasn’t flowing through the ghchaise.com website. The money wasn’t flowing because the customers weren’t flowing. Everyone who came to the site was getting stuck on search. They couldn't move past it. They were clogged.

The ghchaise.com website had tens of thousands of products to choose from. These were fabrics and textiles of high quality that designers could use to create furniture, drapery, well hangings, etc. The catalogue was so vast that web search was really the only sensible way to get customers to the products they wanted. Search was not about keywords here. It was about filtering and browsing across at least 15 attributes, the permutations of which yielded millions of search combinations.

You can probably already see the problem taking shape. Our client did not have reliable customer flow through her website. So there was no customer flow to turn into reliable revenue.

The Board

In WR we start by drawing a tic-tac-toe board. You can use a piece of paper, a napkin, a white board - it doesn't matter. Across the top of the board, above the 3 columns, write Team, Plan, Action. Down the side, next to each of the 3 rows write Motivation, Resistance, Management. Above the board, at the top of the page, write the customer mission statement. This is the who, what and why – the customer's desire in coming to the website. Work your way through the board, ranking the 9 attributes of WR with X's and O's. O's are awesome. X's are terrible.

Claire made it clear to us who her customer was, and why they were coming to the website. So let’s convert her description into an empathetic mission statement written from the customer's point of view.

Janice

"I am Janice. I design high-end couches for L.A. and N.Y. clients. Right now I am working with the biggest client I've ever had, a celebrity in L.A. They're cool. They're hip. They're demanding. I need to blow them away with my design for their living room remodel. I've always wanted to use ghchaise.com, but have rarely had a budget to allow it. This is a huge opportunity. I need to get into the site, find what I want fast, and order the samples I need to wow my clients."

That's too long of course. Let's revise down to this, "I'm Janice. I'm an up and coming designer under a lot of pressure. I need to search quickly and easily on GH Chaise so that I can find awesome textiles to wow my clients."

Scoring

We have our customer mission statement. Now we fill in the board. We score each Web Reliability attribute with an X or an O. O's represent smooth flow. X's represent blockage. We like to dive straight in to the worst and most obvious issues in Web Reliability. Then we work our way around the board from there. The customer mission statement guides us.

Our client Claire came to us and made it clear that the site was crashing during search. Someone like Janice would come in and start filtering and searching on products. The browser or the server would somehow get overloaded and die. Janice would get stuck, then frustrated, then shortly go order samples from some other provider's website. Claire would lose a customer - most likely permanently.

Dive In To Trouble

WR encourages us to be honest and fearless, and dive into trouble. If we think we may have a big problem somewhere, we start with that. In our case, it’s site search that's crashing. Any kind of crash on a website is a form of resistance. A website can resist customers by being slow or it can resist them by being totally dead. Since the ghchaise.com website is already live on the web, and not in a planning stage, our focus goes to the Action/Resistance cell of our board (the intersection of the Resistance row with the Action column.) The Action column refers to in-progress, real-time, live functioning systems. We're giving this cell an X. Dead websites are the worst.

Team Plan Action
Motivation
Resistance X
Management

The Ripple Effect of the Dreaded X

WR has the ability of showing us why we are getting an X in one cell based on causes in other cells. Problems with flow are almost always connected to multiple issues across a site. When we get an X in Action/Resistance, we immediately know to be suspicious about the Plan, in this case the architecture of the site. For example, if a site is crashing due to unexpected traffic spikes, the problem isn’t related to too many people coming to the site. The problem is that the site was not prepared for success, a problem at the planning level. The site crashes when people search, so the plan for supporting search was faulty somehow. A crash gets an X so the planning that resulted in a crash also gets an X. That goes in the Plan/Resistance cell. The site architects did not properly plan to handle the resistance that can come from complex search functionality overloading the server or the browser.

Team Plan Action
Motivation
Resistance X X
Management

Plan Validation

Now we need to look harder at the plan. Was this plan validated effectively before being put into action? Based on the consistent crashing of search, it’s clear that whatever type of validation was used, and whatever type of management oversight was involved both failed. The fact that search was going to cause a site crash was completely missed by everyone involved. So we give an X to the Plan/Management cell. It sounds harsh, but the site is crashing. This is a failure. We need to be brutally honest in order to fix it.

When we look deeper into the issue, we trace it back to the people who came up with this plan that has failed so badly. We learn that the plan did not go through a validation or testing process, and nobody who was in charge of oversight of the team or the project appeared to have ever asked for this. This type of failure – really an abdication of responsibility - deserves an X.

Team Plan Action
Motivation
Resistance X X
Management X

3 X's and Counting

Our board already has 3 X's. When you have an X anywhere on the board, you feel like you should get to work right away and not worry about the other cells. However, WR sees things differently. Websites are highly complex and intricate. Knotty web problems usually involve interconnecting layers of issues. WR tries to untie these knots and show the interconnections. So we stay focussed and keep working our way through the board.

We can give an O to Team/Motivation. The team who built the site and the client who hired them were all motivated to do excellent work. They just had a bad plan. The plan was good at the Plan/Motivation level. It factored in a number of methods to keep the customer, Janice, engaged and motivated. We can give this an O. Janice's ongoing action in real time was motivated. When the site did not crash, she was able to smoothly act on her motivation. This cell also gets an O. All that's left is Team/Resistance, Action/Management and Team/Management. Not only was the team motivated, but they worked well together. Communication was solid, methods were good. There was no team resistance. There was just a failure of planning and a failure by management to detect this failure through plan validation. Action management was also fine. This is the monitoring level where we have systems that watch server performance and signal problems. Our search problem was a fast crash, not a slow burn. Real time monitoring would not have signaled a problem that could have been remedied by allocating more resources or anything. Remember that we have an X in Plan/Management. The plan was not validated. Plans are created by teams. Teams are held together by managers. Team/Management failed here somehow. This gets an X.

We focus on X's in WR. X's in planning get priority. A bad plan cannot be remedied by good action. So there is something wrong with our plan for supporting search. We know we need to dive in here and question the architecture.

Team Plan Action
Motivation O O O
Resistance O X X
Management X X O

Finally the Root Problem

In an effort to optimize search speed and convenience, the first developers of ghchaise.com chose to build a Javascript-based search function. They thought that they could speed up search by loading all products into the customer's browser in a JSON object. Then the web page would sort through this data sitting natively in the browser and show results very quickly. The server would be hit once instead of many times for each search permutation. This was a perfectly sound idea, but the idea was not validated. This was the key problem. The search plan probably worked great with only 20 or so products. But on ghchaise.com, there were going to be more than 10,000 products subject to search at a time. This large block of data choked the web browser. The browser was just not capable of rendering the page on this quantity of data.

Start With Caching

We knew we had to undo this wholesale Javascript-based search approach. We had to pare down how much data was given to the web browser. But we also knew that we had to test this plan. Even if the browser itself no longer crashed because it's data set was smaller, could the server send the data to the browser fast enough? We did some testing and found that even this was an issue, mainly due to CMS constraints. The data was stored in a way that made filtering across multiple attributes tedious. We remedied this through caching. We put a CloudFlare CDN in front of the site and made sure that each search permutation could be represented by a unique url that could be cached across the suitable CloudFlare network nodes.

Iterate, Iterate, Iterate

Eventually even this CDN caching plan began to fail because the traffic on the site combined with the number of search permutations meant that we would never have a 'warm cache'. A 'warm cache' refers to having a suitable number of pages ready and available for use in the caching tool, in this case CloudFlare. CloudFlare naturally purges stale and unused urls from its cache. If the site had a lot more traffic, CloudFlare would have kept more pages warm in the cache. There was a mismatch here. The plan was still bad.

Algolia

At this point we still had Xs on the board, but fewer of them thanks to our rearchitecting and caching work. Though we had changed several X's to O's, the goal was to get rid of all X's. We still had the original X, slow search.

After more trial and error, more Plan/Management activity, we eventually found a new hosted search service called Algolia. Algolia allows you to populate its search indexes with your data. You can then query the indexes over their API and get results back incredibly quickly. The user gets an experience that feels almost instantaneous. The speed of the system, since it was designed explicitly for fast searches, blew our minds. With Algolia no caching was needed. We just kept the Algolia index fresh by pushing the current product content into it. Having all of the product information kept ‘warm’ in the Algolia index meant we received lightning fast multifaceted search results. This simple, elegant architecture finally transformed the rest of the X's on the board into O's.

Search function had not just been fixed but upgraded. Customers coming to the site were able to find what they were looking for and achieve their goals. Customer flow was restored. And with reliable customer flow came reliable revenue.

Team Plan Action
Motivation O O O
Resistance O O O
Management O O O

Solspace, Inc.
PO Box 7282
Santa Cruz, Ca. 95061

https://solspace.com

© 2019 Solspace, Inc.