Last week we added the ability to search posts using Graph Search, a
feature that has been two years in the making. With one billion new
posts added every day, the posts index contains more than one trillion
total posts, comprising hundreds of terabytes of data. Indexing these
posts and building a system to return real-time results has been a
significant engineering challenge—and this is just beginning.
Collecting the Data
One
of the biggest challenges we faced in building this product was that
Facebook's underlying data schema reflects the needs of a rapidly
iterated web service. New features often demand changes to data schemas,
and our culture aims to make these changes easy for engineers. However,
these variations makes it difficult to sort posts by time, location,
and tags as wall posts, photos, and check-ins all store this information
differently. We currently have 70 different kinds of data we sort and
index on, many of them specific to certain types of posts. In addition,
the data is contained in a production MySQL database. This means that
harvesting this data puts a significant load on databases that are
simultaneously serving production traffic, and the process must be
carefully monitored.
Building the Index
We store our
harvested data in an HBase cluster, from which we execute Hadoop
map-reduce jobs to build our index in a highly parallel process. This
index-building process converts the raw data into a search index that
works with
Unicorn,
our search infrastructure. We separate the data into two parts - the
document data and the inverted index. The document data for each post
contains information that will later be used for ranking results. The
inverted index contains what is traditionally considered to be a search
index, and building the inverted index requires going through each post
and determining which hypothetical search filters match.
Updating the Index
To update the index, we subscribe to changes in the MySQL databases using a technology called
Wormhole.
Whenever a new post is created, an existing post is modified, a post is
deleted, or some connection to a post is edited, we schedule an update
on the corresponding post. To avoid code duplication, we run these
updates through the same logic mentioned in the "Collecting the Data"
section above. However, there is one major distinction: when we
originally collect the data, we intentionally bypass our caches, as we
want to avoid sending them requests for old data that is likely missing
from the cache. When processing an update to the index we hit those
caches, as we expect the data to have been recently accessed and be
present in the cache.
Serving the Index
The posts index
is much larger than other search indexes that Facebook maintains.
Before we started working on the ability to search through posts, all
Facebook search indexes were served entirely from RAM. This was ideal
for quick lookup and was a tenable setup given reasonably small search
indexes. However, storing more than 700 terabytes in RAM imposes a large
amount of overhead, as it involves maintaining an index that is spread
across many racks of machines. The performance cost of having these
machines coordinate with each other drove the Unicorn team to look into
new solutions for serving the posts index. The solution we decided on
involves storing the majority of the index on solid-state flash memory.
We managed to preserve performance by carefully separating out the most
frequently accessed data structures and placing those in RAM.
Ranking Results
With
a trillion posts in the index, most queries return many more results
than anyone could ever read. This leads us to the results ranking step.
To surface content that is valuable and relevant to the user, we use
two primary techniques: query rewriting and dynamic result scoring.
Query rewriting happens before the execution of the query, and involves
tacking on optional clauses to search queries that bias the posts we
retrieve towards results that we think will be more valuable to the
user. Result scoring involves sorting and selecting documents based on a
number of ranking "features," each of which is based on the information
available in the document data. In total, we currently calculate well
over a hundred distinct ranking features that are combined with a
ranking model to find the best results. We will continue to work on
refining these models as we roll out to more users and listen to
feedback.
Project History
Like many other products at
Facebook, the ability to search over posts was originally conceived as a
Hackathon project. My second day as a Facebook intern coincided with a
company-wide Hackathon, and I spent the night aiming to implement a way
for my friends and me to find old posts we had written. I quickly
discovered that the project was much more challenging than I had first
anticipated. However, the engineering culture at Facebook meant that I
was supported and encouraged to continue working on it, despite the
scope of the project. The majority of the work--infrastructure, ranking,
and product--has been accomplished in the past year by a few dozen
engineers on the Graph Search team. We are very excited to share the
results of our hard work with people using Facebook, and look forward to
seeing how people use the new feature and improving it based on their
feedback. We hope that being able to search for posts will enable a
richer Facebook experience for everyone.
source -
facebook engineering