Behavioral Signal Engine:
User Engagement Profiling and Recommendation Intelligence
Behavioral Signal Engine:
User Engagement Profiling and Recommendation Intelligence
Recommendation systems are easy to build and hard to build well. The difference usually comes down to one thing that most implementations skip: not all behavioral data is equally trustworthy. A user who rated three films five years ago and never returned is not the same kind of signal as a user who has been rating consistently across genres for two years. Treating them the same way produces a model that is technically trained but practically unreliable.
This project starts from that question. Before building a recommendation system, it asks which users are worth building one for, and how much to trust each user's behavioral data when doing so. The dataset is real 100,836 ratings from 610 users across 9,724 films, collected by GroupLens Research at the University of Minnesota and published as the MovieLens dataset. No synthetic data was generated or used at any stage of this analysis.
The work runs across four areas: profiling user engagement quality from behavioral signals, comparing two recommendation approaches against each other on real held-out data, detecting anomalous behavioral patterns statistically, and evaluating the comparison through a structured experiment framework that states what it proves and what it doesn't.
What are we going to analyze?
The analysis works through the data in four stages.
The first stage builds a behavioral profile for each user from signals that are observable in any engagement dataset, how much they engage, how broadly, how consistently, and whether they differentiate between content they respond to differently.
The second stage compares collaborative filtering and content-based ranking, not to declare one a winner in the abstract, but to understand where each approach works and where it breaks down on this specific population.
The third stage runs statistical checks on the rating stream to identify users whose behavioral patterns don't reflect genuine engagement.
The fourth stage formalizes the comparison between approaches into an experiment with a measurable result and an honest statement of its limitations.
User Profile Quality Distribution
The bar chart shows how 610 users distribute across four quality tiers, Low, Medium, High, and Power, based on their behavioral signals. The horizontal axis represents the four tiers and the vertical axis shows how many users fall into each.
The most immediate observation is that 473 users, the clear majority, sit in the Medium tier. These are users who engage with the platform regularly enough to generate useful signal, but whose patterns don't show the depth or breadth that would move them into the High tier. The 105 High tier users are the more valuable segment, they rate across a wider range of genres, they return to the platform with some consistency over time, and they differentiate between content they respond to differently rather than giving everything a similar score. The 32 Low tier users have ratings in the dataset but not enough of them, or not spread in ways that make their preferences readable.
The Power tier sitting at zero reflects something honest about the dataset itself. With 610 users as a development sample, the density required to reach that tier, high volume, high genre diversity, high consistency, and high discrimination all together, simply wasn't present. A production dataset with millions of users would fill that tier and the recommendations built from it would be meaningfully stronger.
Engagement Signal Correlations
The heatmap shows the relationships between five behavioral signals: rating count, genre entropy, activity consistency, rating discrimination, and the overall profile quality score. Each cell shows the correlation coefficient between the two signals on its axes, ranging from -1 for a perfect negative relationship to 1 for a perfect positive one.
The strongest relationship in the matrix is between activity consistency and profile quality score, at 0.73. Users who engage over a spread of time returning across weeks and months rather than rating in a single concentrated session tend to have stronger overall profiles. This makes intuitive sense for a recommendation system: a user whose preferences you can observe across different moods, seasons, and contexts gives you a more reliable picture than one who rated 200 films in a weekend and never came back.
The -0.39 between rating count and activity consistency is the most interesting cell in the matrix. It tells you that users who have given the most ratings tend to have done so in bursts rather than evenly over time. Volume and consistency are not the same thing. A recommendation system that treats them as equivalent would overvalue users who engaged heavily in a single session and undervalue the more consistent but less prolific users whose preferences are actually more stable.
Trust Signals Detected
The horizontal bar chart shows how many users were flagged by each of the four trust detectors. The total is 27 users out of 610, a flag rate of 4.43%.
The largest bar belongs to implausible daily volume, 13 users who rated more items in a single day than genuine engagement would explain. The rapid bulk rating bar shows 10 users whose consecutive ratings came within two seconds of each other, a pattern more consistent with automated behavior than with someone actually watching or considering films. Three users showed outlier rating patterns, sitting more than three standard deviations away from the population mean in their average scores. One user gave near-identical ratings to everything they rated, showing no differentiation at all.
None of these users were simply removed. The 13 flagged for high-risk patterns were excluded from the recommendation training data. The remaining 14, flagged for lower-severity signals, were down-weighted rather than discarded, their data still contributed to the model, but with less influence. Even imperfect behavioral signal carries some information, and removing it entirely would leave the model with less to work from than treating it carefully.
Profile Quality Score Distribution
The histogram shows how profile quality scores are distributed across all 610 users. The horizontal axis runs from 0 to 1 and represents the quality score. The vertical axis shows how many users received each score. The dotted vertical lines mark the tier boundaries at 0.25, 0.50, and 0.75.
The distribution is roughly bell-shaped and centered just above 0.4, with the mean sitting at 0.413. Most users cluster in the 0.3 to 0.6 range, which is the middle of the Medium quality band. The left tail below 0.25 represents the 32 Low tier users whose signals were too sparse or too uniform to generate a stronger score. The right side shows the gradual thinning of the High tier users above 0.5, and nothing crosses the 0.75 boundary where the Power tier would begin.
What the shape of this distribution tells you practically is that the dataset is usable but not exceptional. A system built on it can generate reasonable recommendations for the users in the middle of the distribution. The edges represent the limits of what a development-scale dataset can support, and a production deployment would want to monitor this distribution over time as user behavior evolves.
Ratings per User Distribution
The histogram shows how many ratings each of the 610 users has given. The horizontal axis represents the number of ratings and the vertical axis shows how many users fall into each range. The mean sits at 165 and the median at 70.
The gap between those two numbers carries more information than either number alone. A mean of 165 and a median of 70 means the distribution is heavily right-skewed, a small number of very active users have given thousands of ratings each, pulling the average well above what most users actually do. Those users are visible in the long right tail that extends past 1,000 and reaches up to about 2,500. The majority of users sit in the first two bars of the chart, having rated somewhere between 20 and a few hundred films.
This pattern appears in almost every real-world engagement dataset. A model trained without accounting for this skew learns preferences that reflect the small minority of hyperactive users rather than the typical user. The profile quality score addresses this by weighting consistency and diversity alongside volume a user with 2,500 ratings concentrated in a single week scores lower than a user with 150 ratings spread evenly across two years, because the latter gives the system a more reliable picture of stable preferences.
A/B Experiment: Collaborative Filtering vs Content-Based Ranking
The experiment compared two ranking approaches on the same population of users, using held-out ratings as the evaluation metric. For each of 200 users, ratings were split by time, the earlier 80 percent used for building the model and the later 20 percent held out for evaluation. Content-based ranking served as the control. Collaborative filtering using SVD matrix factorization served as the treatment.
Content-based ranking produced a mean absolute error of 0.8384 on held-out ratings. Collaborative filtering produced a mean absolute error of 1.3473. The difference is statistically significant p-value of 0.0000 and effect size of -0.7218. On this dataset, content-based ranking predicts what users will actually rate more accurately than matrix factorization.
The caveat matters as much as the result. This experiment ran on historical data, not live users. Mean absolute error on held-out ratings measures how well the model predicts scores, not whether users actually engage with recommended content. Those are related but not identical. The practical interpretation is that content-based ranking is worth prioritizing for this population, particularly given the 95.9% sparsity of the rating matrix which limits how much collaborative filtering can learn from the available data. Confirming that with a live engagement experiment would be the right next step before making a deployment decision based on these numbers alone.
What this analysis tells us
The results of this project point in a consistent direction. On a dataset with high sparsity and a user population that skews toward moderate engagement, content-based ranking outperforms collaborative filtering on the metric that was measured. But the more durable finding is in the profiling layer: not all behavioral data is the same quality, and a system that accounts for that by weighting users differently based on the reliability of their signals, and by excluding genuinely anomalous patterns from training is building on a more honest foundation than one that treats every rating as equally informative.
The 4.43% flag rate from trust detection, the 0.73 correlation between consistency and quality, the mean quality score of 0.413, these are not impressive numbers on their own. They are accurate numbers. They describe a real dataset honestly. And a recommendation system built from honest data, even imperfect data, is more reliable than one built from data that was accepted uncritically.