created by sacredtremor on 17/05/2021 at 16:24 UTC*
134 upvotes, 3 top-level comments (showing 3)
By Jovan Sardinha, Yue Jin, Alexander Trimm and Garrett Hoffman
Over the years, Reddit has evolved to become a vast and diverse place. At its core, Reddit is a network of communities. From the content in your feeds to the culture you find in discussions across the site, communities are the lifeblood that makes Reddit what it is today. Reddit’s growth over the years has put extreme pressure on the data processing and serving systems that have served us in the past.
This is the journey of how we are building systems that adapt to Reddit and what this has to do with a search for better guides.
​
Getting comfortable navigating a new place is never easy. Whether it’s learning a new subject or exploring a different environment, we’ve all experienced that overwhelming feeling at some point. This feeling holds us back until we have good guides that help us navigate the new terrain.
The sheer scale and diversity that Reddit embodies can be challenging to maneuver at first. If Reddit were a city, the r/popular[1] page would be the town hall, where you can see what is drawing the most discussion. This is where new users get their first taste of Reddit and our core users stumble upon new communities to add to their vast catalogue. The home feed at reddit.com[2] would be the equivalent to a neighborhood park and where each user gets personalized content based on what they subscribed to. For our users, these feeds act as important guides that help them navigate Reddit and discover content that is relevant to their interests.
1: https://www.reddit.com/r/popular/
​
In 2016, our machine learning models promoted discussion and content that was fresh and liked by people similar to you. This promoted new content and communities that showcased what Reddit had to offer at a point in time.
With more diversity of content being published to the platform, our original approach started breaking down. Today, content on Reddit completely changes in minutes; while content that would be relevant to a user could change depending on what they recently visited.
The users that make up Reddit are more diverse than ever before. People with a variety of backgrounds, beliefs and situations visit Reddit everyday. In addition, our user interests and attitudes change over time and expect their Reddit experience to reflect this change.
Our traditional approaches did not personalize the Reddit experience to accommodate this dynamic environment. Given the amount of change that was taking place, we knew we were quickly approaching a breaking point.
​
To build something our users would love:
To do this, we broke down user personalization into a collection of supervised learning subtasks. These subtasks enable our systems to learn a general personalization policy. To help us iteratively learn this policy, we set up a closed loop system (as illustrated below) where each experiment builds on previous learnings:
This system is made up of four key components. These components work together to generate a personalized feed experience for each Reddit user. A further breakdown of each component:
​
These datasets contain features that are aggregated on a per user, per post basis across a bounded time horizon (as shown in the image above). Models that train on these datasets simultaneously embed users, subreddits, posts, and user contexts which allow them to predict user actions for a specific situation. For example, for each Reddit user, the model is able to assign a probability the user will upvote any new post, while also assigning a probability the user will subscribe to that subreddit, and if they will comment on the post. These probabilities can be used to estimate long term measures such as retention.
Multi-task models have become particularly important at Reddit. Users engage with content in many ways, with many content types, and their engagement tells us what content and communities they value. This type of training also implicitly captures negative feedback - content the user chose not to engage with, downvotes, or communities they unsubscribe from.
We train our multi-task neural network models (example architecture shown below) using simple gradient descent-style optimization - like that provided by TensorFlow. At Reddit, we layer sequential Monte Carlo algorithms on top to search for model topology given a collection of subtasks. This allows us to start simple and systematically explore the search space in order to demonstrate the relative value of deep and multi-task structures.
We have a system that allows anyone at Reddit to easily create new machine learning features. Once these features are created, this system takes care of updating, storing and making these features available to our models in a performant manner.
For real-time features, an event processing system that is built on Kafka pipelines and Flink stream processing directly consumes every key event in real-time to compute features. Similar to the batch features, our systems take care of making these features available to the model in a performant manner.
This component maintains a 99.9% uptime and constructs a feed with p99 in the low hundred milliseconds. Which means that this design should hold as we scale to handle trillions of recommendations per day.
​
​
​
As the world around has changed, we’ve evolved Reddit’s platform:
‘Evolve’ is a core value for all of us at Reddit. This system not only gives us the ability to deal with an ever growing platform, but to try different approaches at a much faster rate. Our next steps will involve experimentation at a new scale as we better understand what makes this place special for our users.
We believe we are just taking the first steps in our journey and our most important changes are yet to come. If this is something that interests you and you would like to join our machine learning teams, check out our careers page[3] for a list of open positions.
3: https://www.redditinc.com/careers/#hiring
Comment by L18CP at 17/05/2021 at 20:19 UTC
2 upvotes, 0 direct replies
First
Comment by yourtalllife at 18/05/2021 at 00:43 UTC
2 upvotes, 0 direct replies
How do you measure the quality of the recommendations? The "model evaluation" sections seems a bit sparse.
Comment by P0pMan20 at 19/05/2021 at 11:49 UTC
2 upvotes, 0 direct replies
Interesting.