Way To Hackintosh

<div dir="ltr" style="text-align: left;" trbidi="on">
Last week we added the ability to search posts using Graph Search, a 
feature that has been two years in the making. With one billion new 
posts added every day, the posts index contains more than one trillion 
total posts, comprising hundreds of terabytes of data. Indexing these 
posts and building a system to return real-time results has been a 
significant engineering challenge—and this is just beginning. 
 
Collecting the Data 
One
 of the biggest challenges we faced in building this product was that 
Facebook's underlying data schema reflects the needs of a rapidly 
iterated web service. New features often demand changes to data schemas,
 and our culture aims to make these changes easy for engineers. However,
 these variations makes it difficult to sort posts by time, location, 
and tags as wall posts, photos, and check-ins all store this information
 differently. We currently have 70 different kinds of data we sort and 
index on, many of them specific to certain types of posts. In addition, 
the data is contained in a production MySQL database. This means that 
harvesting this data puts a significant load on databases that are 
simultaneously serving production traffic, and the process must be 
carefully monitored. 
 
Building the Index 
We store our 
harvested data in an HBase cluster, from which we execute Hadoop 
map-reduce jobs to build our index in a highly parallel process. This 
index-building process converts the raw data into a search index that 
works with <a href="https://www.facebook.com/notes/facebook-engineering/under-the-hood-indexing-and-ranking-in-graph-search/10151361720763920">Unicorn</a>,
 our search infrastructure. We separate the data into two parts - the 
document data and the inverted index. The document data for each post 
contains information that will later be used for ranking results. The 
inverted index contains what is traditionally considered to be a search 
index, and building the inverted index requires going through each post 
and determining which hypothetical search filters match. 
 
Updating the Index 
To update the index, we subscribe to changes in the MySQL databases using a technology called <a href="https://www.facebook.com/notes/facebook-engineering/wormhole-pubsub-system-moving-data-through-space-and-time/10151504075843920">Wormhole</a>.
 Whenever a new post is created, an existing post is modified, a post is
 deleted, or some connection to a post is edited, we schedule an update 
on the corresponding post. To avoid code duplication, we run these 
updates through the same logic mentioned in the "Collecting the Data" 
section above. However, there is one major distinction: when we 
originally collect the data, we intentionally bypass our caches, as we 
want to avoid sending them requests for old data that is likely missing 
from the cache. When processing an update to the index we hit those 
caches, as we expect the data to have been recently accessed and be 
present in the cache. 
 
Serving the Index 
The posts index
 is much larger than other search indexes that Facebook maintains. 
Before we started working on the ability to search through posts, all 
Facebook search indexes were served entirely from RAM. This was ideal 
for quick lookup and was a tenable setup given reasonably small search 
indexes. However, storing more than 700 terabytes in RAM imposes a large
 amount of overhead, as it involves maintaining an index that is spread 
across many racks of machines. The performance cost of having these 
machines coordinate with each other drove the Unicorn team to look into 
new solutions for serving the posts index. The solution we decided on 
involves storing the majority of the index on solid-state flash memory. 
We managed to preserve performance by carefully separating out the most 
frequently accessed data structures and placing those in RAM. 
 
Ranking Results 
With
 a trillion posts in the index, most queries return many more results 
than anyone could ever read.  This leads us to the results ranking step.
 To surface content that is valuable and relevant to the user, we use 
two primary techniques: query rewriting and dynamic result scoring. 
Query rewriting happens before the execution of the query, and involves 
tacking on optional clauses to search queries that bias the posts we 
retrieve towards results that we think will be more valuable to the 
user. Result scoring involves sorting and selecting documents based on a
 number of ranking "features," each of which is based on the information
 available in the document data. In total, we currently calculate well 
over a hundred distinct ranking features that are combined with a 
ranking model to find the best results. We will continue to work on 
refining these models as we roll out to more users and listen to 
feedback. 
 
Project History 
Like many other products at 
Facebook, the ability to search over posts was originally conceived as a
 Hackathon project. My second day as a Facebook intern coincided with a 
company-wide Hackathon, and I spent the night aiming to implement a way 
for my friends and me to find old posts we had written. I quickly 
discovered that the project was much more challenging than I had first 
anticipated. However, the engineering culture at Facebook meant that I 
was supported and encouraged to continue working on it, despite the 
scope of the project. The majority of the work--infrastructure, ranking,
 and product--has been accomplished in the past year by a few dozen 
engineers on the Graph Search team. We are very excited to share the 
results of our hard work with people using Facebook, and look forward to
 seeing how people use the new feature and improving it based on their 
feedback. We hope that being able to search for posts will enable a 
richer Facebook experience for everyone. 
 
source - <a href="https://www.facebook.com/Engineering?ref=stream">facebook engineering </a></div>