Recommend personalized content for your app users using BigQuery

Abdul Rahman Babil
Google Cloud - Community
7 min readJan 9, 2021

--

How many times you were surprised when your favorite app show you relevant content?

Whether you’re using Youtube, Netflix, or any other app you will notice that recommendation system is the main player in their success, but having a recommendation system could be a bit heavy for your startup, adding Machine learning algorithms like collaborative filtering / content-based filtering, after that, you need to train a model based on your big data, at the end you will end up with a static model to predict content relevance to the user.

The previous approach is hard and expensive to apply, any model training will take hours, and that might make your recommendation engine useless by time!!, Model drift is an issue that will chase you all the time, users’ data keep changing in a very fast-paced way.

Alternative solution

The solution I would propose here is related to News aggregators, It’s using explicit rules, unlike ML! you can change the rules to match your domain.

News aggregators build a curated list of articles from different publishers/topics and let users able to follow publishers and read articles, smart news aggregators try to serve the user with relevant articles, based on the study case we noticed few rules to improve the user experience when it comes to serving articles:

The user is interested in articles from specific topics.

User prefers some publishers more than other publishers.

User prefers articles talking having some main keywords.

The recommender should not recommends articles that users have already been read.

This Recommender is simple as it’s a command that runs as a cronjob to aggregate recommended articles for users on an hourly basis, work will be done in offline mode to prepare recommended articles to be pushed/highlighted for users, this job can take few mins to execute even when you have thousands of articles and hundreds of thousands of users!

BigQuery

Why I preferred to use BigQuery in the first place?

BigQuery is a serverless service from GCP (Google Cloud Platform), very efficient and powerful for BigData work, it can be integrated with other GCP services in a very easy way, no need to do any sysops or hosting, and it has CLI and SDKs in many modern programming languages like PHP, Go, node.js … etc, easy to load data, have 2 different approaches for billing and many other reasons, just take a look over it from here.

Make sure you use BigQuery as a warehouse, you keep a copy there for long term storage, you probably need to build ETL pipelines to keep moving data from your DB/ES to BigQuery, GCP providing a tool to handle that for you Google Data Fusion

Main dish!

First, all the queries here and example data are available on Github.

I won’t discuss how you gonna use this query, for me I would use Laravel, it has a nice way to build commands using Artisan Console, you just need to add PHP BigQuery SDK to your composer!

Here is the data we need to process, article table for candidate articles, user the table that has users, each article belongs to topic and publisher

user_article_feed keeps articles that the user has seen in the feed.

user_article_read keeps articles that the user has read before.

To explore the article table, let us run:

select * from test_data.article

In user table, we have 1000 users (id, name).

Based on historical user data, we can notice that user X could be interested in some topics/publishers based on views/shares/reactions/CTR .. etc, from these data we can generate a user-METRIC relevance score (0 -> 1), and here were user_publisher and user_topic come, both are pre-calculated tables and have to be well maintained and updated frequently.

In this example, we have 100 different articles and 1000 users, which means we have 100,000 different combinations, each user has 100 different articles he can read, but at the end, a single user needs top X relevant articles, BigQuery can offer the great and scalable experience to do that, just using Standard SQL!

I will show you the simple query that can achieve that:

with article as 
(
select * from test_data.article
),
user as (
select * from test_data.user
),
topic as (
select * from test_data.topic
),
user_read as (
select * from `test_data.user_article_read`
),
user_feed as (
select * from `test_data.user_article_feed`
),
user_publisher as (
select * from test_data.user_publisher
),
user_topic as (
select * from test_data.user_topic
),
enhanced_articles as (
select *,
ROW_NUMBER() over (Partition BY article.topic_id ORDER BY score DESC ) as article_rank,
max(score) over (Partition BY article.topic_id) as article_max_score,
min(score) over (Partition BY article.topic_id) as article_min_score,
1000 * (score / max(score) over (Partition BY article.topic_id)) as enhanced_scroll
from article
),
user_article as (
select user.id as user_id, article.id as article_id, article.score as original_score, article.publisher_id , article.topic_id, article.enhanced_scroll
from user cross join enhanced_articles as article
),
user_article_with_dimension_relevance as (
select user_article.*,
user_publisher.relevance as user_publisher_relevance,
user_topic.relevance as user_topic_relevance
from user_article
left join user_publisher on user_publisher.user_id = user_article.user_id and user_article.publisher_id = user_publisher.publisher_id
left join user_topic on user_topic.user_id = user_article.user_id and user_topic.topic_id = user_article.topic_id
),
user_article_check_read_and_feed as (
select distinct user_article_with_dimension_relevance.*,
IF(user_read.article_id is null, 0, 1) as read,
IF(user_feed.article_id is null, 0, 1) as feed
from user_article_with_dimension_relevance
left join user_feed on user_feed.user_id = user_article_with_dimension_relevance.user_id and user_feed.article_id = user_article_with_dimension_relevance.article_id
left join user_read on user_read.user_id = user_article_with_dimension_relevance.user_id and user_read.article_id = user_article_with_dimension_relevance.article_id
),
user_article_filtered as (
select * from user_article_check_read_and_feed where read = 1 or feed = 1 or user_publisher_relevance = 0 or user_topic_relevance = 0
),
user_article_with_relevance_score as (
select u.user_id, u.article_id, u.original_score, u.publisher_id, u.topic_id, ((user_publisher_relevance+user_topic_relevance)/2) as relevance_score, u.enhanced_scroll
from user_article_filtered as u
),
user_article_with_calculated_score as (
select *, (enhanced_scroll*relevance_score) user_article_relevance_score,
ROW_NUMBER() over (Partition BY user_id ORDER BY (enhanced_scroll*relevance_score) DESC ) as user_article_rank,
from user_article_with_relevance_score
),
user_with_top_articles as (
select user_id,
to_json_string(ARRAY_AGG(
STRUCT<article_id INT64, original_score INT64, publisher_id INT64, topic_id INT64, relevance_score FLOAT64, enhanced_scroll FLOAT64, user_article_relevance_score FLOAT64, user_article_rank INT64>(article_id,original_score,publisher_id, topic_id, relevance_score, enhanced_scroll, user_article_relevance_score, user_article_rank)
) ) data
from user_article_with_calculated_score
where user_article_rank <= 5
group by user_id
)
select * from user_with_top_articles

So this query is considered as a pipeline of steps, or a recursive query, execution is bottom to up, but we will discuss it up to down.

I will skip the first few sub-queries because it’s too simple, enhance_articles meant to do common ground for articles score, because article’s score value might vary from different topics, so we make article’s score as a value from 0 -> 1000.

user_article_with_dimension_relevance Will have all user-article combinations with article’s score, user article-topic score, and user article-publisher score.

user_article_check_read_and_feed in this step, we will filter some rows, if the user reads the article or he got inside his feed, no need to recommend it again, so that will lower the number of candidate articles for the user.

user_article_with_calculated_score here we calculate user articles score based on the article’s score and user-article-topic/user-article-publisher score

The first Recommended article for user #2 has a low score comparing with another article, but its’ topic and publisher is the preferred for the user, and that article has the best score (1000) in its’ topic, and the user hasn’t read that article yet, that means there is a high chance it will grab user attention!

Also, we notice the 2nd recommended article for user #2 is an article that has lower article’s score comparing to 3rd one, both of them has the same publisher, but from different topic, and that user is more interested in topic #34 comparing to topic#20, so the new score will be (relevance score + article enhanced scroll) /2 as the final score to be used in articles ranking.

user_with_top_articles this step is just to change the output way, to make it easier for your command to get results, so rather than getting X rows for each user, you get 1 row per user, and articles are grouped in JSON array!

Conclusion

BigQuery is a powerful tool that can process terabytes of data in seconds/minutes, you can use that power in background jobs in many different ways, here I showed a simple approach I use to process ~100GB of data in 2 minutes, only by using SQL.

The recommendation engine is a very crucial part in your app, this one is a very simple one for teams they don’t have ML scientist, also it’s an example to show you how BigQuery can be used

Feel free to share your thoughts in the comments below, If you have any questions or feedback don’t hesitate to ask so!

--

--

Abdul Rahman Babil
Google Cloud - Community

Tech Lead at Newswav | Backend and Android developer | entrepreneur | micro investor