Rob Sears bio photo

Rob Sears

       

Rocket scientist. Computer hacker. Geek before it was cool.

BTC Donations:
1AU9qGkSubhR24r8Y4WEoV8bccZjeT2dKg

Over the past 6 years of working on Websolr at One More Cloud, I’ve talked with hundreds of customers about how to improve their Solr performance. One of the most common complaints I’ve seen involves latency. Customers use a tool like NewRelic to benchmark their queries, see hundreds of ms per request, and write in to ask why Solr is so slow.

The thing is, Solr is a battle-tested beast of a search engine, and is blazing fast. What NewRelic is actually measuring here, more than Solr, is the network transit time. It’s not uncommon for a round-trip, over-the-wire request to spend 95% of its time in transit, and 5% of its time in Solr.

When I explain this, I usually offer some suggestions for reducing network latency as a first step before tinkering with query tuning. These suggestions are usually along the lines of:

1. Use HTTP Keep-Alive. There is a small amount of overhead when creating an HTTP connection, and this is especially true with encrypted connections provided by SSL/TLS. You can dramatically reduce this overhead with a persistent connection, whereby multiple requests can take place over the same connection for a period of time.

2. Utilize load balancing. This one is particular to Websolr, where all indices have at least two cores per shard: a primary and a replica. By default, all requests hit the primary, and the replica is treated as a shadow copy. This allows for near-real time (NRT) search. However, this also means that the primary core is running double-duty, which is a problem under load. The Websolr routing proxy can route searches to the replica if a certain header or request parameter is present.

3. Compress your payloads. There’s a point at which the overhead of compressing your data at the source and decompressing it at the destination is less than transmitting the data uncompressed (especially via SSL). This is especially useful when bulk indexing.

Typically when a customer does the legwork to implement these three things, they see dramatic and immediate improvements in latency.

I realized recently that, having made these recommendations over and over, I’d never actually had to do the work myself. It seemed like something worth documenting so that when a customer inevitably comes to us with a sad NewRelic chart, I can point them to some sample code.

I did some research and noted that a majority of customers who had latency issues with Solr were running some kind of Ruby app (typically Rails) and were using the Sunspot gem. So that’s where I started. I spent an afternoon tinkering and was shocked at that implementing these suggestions sped things up by nearly 350%.

Here’s how I did it…

Sunspot Under The Hood

If you have a Ruby app using Solr, there is a good chance that you’re relying on the RSolr library. Even if you’re using the Sunspot gem, it’s still using RSolr under the hood. RSolr is responsible for managing the connection to Solr, via the Faraday HTTP client library.

Faraday provides a common interface between an application and a variety of HTTP client libraries via adapters. I had to dig through a lot of code to discover that RSolr is using Faraday’s default HTTP adapter, which is the very-dated Net::HTTP library.

I suspect that RSolr and Faraday are using Net::HTTP because it’s part of the Ruby Standard Library and doesn’t introduce external dependencies. If you have Ruby, you probably also have Net::HTTP, so it’s likely the safest option. Just not the best, especially in 2019 (or whatever year you’re reading this).

Personally, I like Typhoeus because it wraps libcurl (so you can pass curl params directly to it), and it supports parallel requests. It’s also much faster and more efficient at handling SSL/TLS requests. It’s possible to run compression and HTTP Keep-Alive with Typhoeus. It’s just generally better. And there’s a Typhoeus adapter for Faraday!

Adding Typhoeus

It’s sort of a pain, because simply adding Typhoeus to your Gemfile will not be enough. Neither Sunspot nor RSolr will pick it up automatically, and Sunspot doesn’t let you specify Faraday adapters in the sunspot.yml configuration file. I had to create a custom connection class that invoked RSolr::Client with a Faraday object that had the Typhoeus adapter.

It’s straightforward conceptually, but involves more code than I wanted to add to a Rails codebase. I like clean solutions. So, I packaged it into a gem.

You can check it out here: https://github.com/omc/websolr-gem. The important bits are:

module Websolr
  class Railtie < ::Rails::Railtie
    initializer 'setup_solr' do
      require 'rsolr'
      Sunspot::Session.connection_class = Websolr::Connection.new({
        'X-Websolr-Routing': 'prefer-replica'
      })
    end
  end
end

This creates an initializer that overrides the default Sunspot::Session connection class with a custom class I made here. That class looks like this:

module Websolr
  class Connection
    attr_accessor :connection, :default_headers

    def initialize(default_headers = {})
      self.default_headers = default_headers
      self.connection      = create_connection
    end

    def connect(opts = {})
      RSolr::Client.new(connection, opts)
    end

    def create_connection
      conn_opts = { request: {} }
      conn_opts[:request][:params_encoder] = Faraday::FlatParamsEncoder
      conn_opts[:headers] = default_headers
      Faraday.new(conn_opts) do |conn|
        conn.response :raise_error
        conn.headers = {
          user_agent: 'Websolr Client (Faraday with Typhoeus)',
          'Keep-Alive': 'timeout=10, max=1000'
        }.merge(conn_opts[:headers])
        conn.adapter  :typhoeus
      end
    end
  end
end

Now when Sunspot runs, it will use the custom class, which is using Typhoeus, HTTP Keep-Alive, and a header that tells the Websolr proxy to load balance read/write operations between the replica and primary cores, respectively.

I didn’t implement compression because RSolr doesn’t support it at all. I verified that it’s possible with some meta-programming and class_eval, but it feels like a PR on the RSolr project is probably a more appropriate way to go there.

Benchmarking

We have a public repo with a basic Rails app, a single model, and a Rake task for generating records. The purpose of this app was to give our Success team a framework for dogfooding our documentation. I checked out a branch for we already had for Sunspot. We also already had a test index on Websolr available, so I updated the demo app’s config/sunspot.yml file:

development:
  solr:
    scheme: https  # Make sure we're testing SSL!
    hostname: us-east-1.websolr.com
    port: 443
    log_level: INFO
    path: /solr/xxxxxxxxxx # Not the actual path :)

Next, I generated 50K User records using the app’s built-in Rake task. When that was done, I ran a few quick tests using the Benchmark library:

# Get all the User records into memory:
users=User.all

# Benchmark a complete reindex:
puts Benchmark.measure { User.reindex }

# Benchmark 1,000 searches:
puts Benchmark.measure { 1_000.times { User.search { fulltext users.sample.first_name } } }

# Benchmark 1,000 random requests, roughly 1/3 searches and 2/3 updates:
puts Benchmark.measure {
  1_000.times do
    if ((rand() * 12).to_i % 3 == 0)
      User.search { fulltext users.sample.first_name }
    else
      User.where(id: users.sample.id).update_all(
        first_name: rand(36**10).to_s(36)
      )
    end
  end
}

These tests gave me some baseline numbers for how Sunspot performed by default. Then, because I’d shoved my code into a gem, I just added the gem to the project’s Gemfile:

gem 'websolr', path: '/local/path/to/gem'

And ran bundle install. I cleared the test Websolr index and re-ran the benchmark tests. Here’s what I found, averaging the results over multiple trials:

Reindex 50K Documents
(batch size of 50)
1K Random Searches 1K Random Operations
(read/update)
No Gem 496.8 s 196.7 s 63.6 s
With Gem 279.1 s 44 s 17.6 s
Speed Increase 78% 347% 261%
Latency Decrease 43.8% 77.6% 72.3%

That’s a crazy performance improvement over the Net::HTTP adapter. Typhoeus’ libcurl-based solution, along with HTTP Keep-Alive was enough to give almost a 350% speed increase in searches.

Wrapping Up

It was pretty surprising to me that I was able to wring out such a massive performance boost in Sunspot with such a basic gem. I repeated the trials a few times, and the results were pretty consistent. I even A/B tested the gem with Websolr’s authentication system, and the results held up.

The only downside is that I had to monkey-patch RSolr to get authentication and compression to work, which is a pretty terrible approach in my opinion. Ideally, I’d like to write a PR for RSolr that adds this support, and then have my gem pass through the proper parameters.