Stock Price Prediction using News Sentiment & Machine Learning

Exploring the Intersection of Market Data, Sentiment Analysis, and Predictive Modeling

In a market where prices can shift in seconds, forecasting stock movements requires more than historical data — it demands insight into how public sentiment, technical trends, and volatility converge. This project tackles that challenge head-on, combining natural language processing (NLP) and technical analysis indicators to predict Tesla's stock performance with precision and interpretability.

Built entirely with open-source tools in Python, the system pulls real-time market data and recent news headlines to construct a comprehensive feature set. By engineering features such as RSI, MACD, sentiment deltas, and cumulative returns, and by modeling them through ensemble machine learning algorithms, we create a hybrid prediction framework that outputs both the next day’s price and its directional movement.

🔍 What Are We Analyzing?

The focus of this project is to understand how investor sentiment, extracted from recent Tesla-related news headlines, correlates and interacts with technical indicators such as momentum and volatility, and how this combined data can help us predict future stock behavior. Specifically, we ask:

How do short-term public emotions and headlines affect Tesla’s price behavior?
Can momentum signals such as RSI and MACD reinforce sentiment indicators to provide a clearer picture of upcoming trends?
What role do volume shifts and volatility play in mediating these effects?

Ultimately, this investigation aims to train and evaluate both classification and regression models that answer two key questions:

Classification: Will Tesla’s stock price go up or down tomorrow?
Regression: What will be the actual closing price?

🛠️ Technologies and Tools Used

yfinance Fetching historical stock data

requests + NewsAPI Extracting real-time news headlines related to Tesla

nltk.punkt Sentence tokenization for text preprocessing

TextBlob, VADER Sentiment analysis of news headlines

pandas, numpy Data manipulation and preprocessing

scikit-learn Machine learning models and evaluation

xgboost Gradient boosting classifier

matplotlib, seaborn, plotly Data visualization and model diagnostics

wordcloud Visual representation of frequent terms

Data Flow and Processing

1. Stock Data Acquisition and Engineering

We retrieved 200 days of TSLA price data using yfinance, focusing on columns such as Open, High, Low, Close, Volume, from which we derived additional features:

Daily returns, calculated as the percentage difference in closing prices across days.
Rolling volatility (Vol_5), as a 5-day rolling standard deviation of returns.
Cumulative returns, capturing long-term performance momentum.

These indicators allowed us to assess not just market direction but also market strength and risk on a rolling basis.

2. News Sentiment Acquisition and Scoring

To capture public sentiment, we pulled Tesla headlines from NewsAPI (limited to 30 days) and preprocessed them using:

NLTK's punkt tokenizer
VADER for rule-based compound polarity scoring,
and TextBlob for additional polarity checks.

Daily headlines were grouped by date, and average sentiment scores were calculated, then lagged and differenced to capture momentum in sentiment over time. These were later merged with stock data to form a unified modeling dataset.

Feature Engineering

Key engineered variables included:

Avg_Sentiment, Sentiment_Change, and Lagged_Sentiment
Volume_Change, Price_Change, Price_Pct_Change
Vol_5, Return, Cum_Return
RSI_14, MACD, MACD_Signal
Targets: Target_Class (up/down) and Target_Price (numeric close)

Key Focus Areas:

1.Sentiment as a Leading Indicator
We investigate how positive or negative tone in headlines influences daily stock returns. We use TextBlob and VADER, two well-established natural language processing tools, to extract polarity scores from Tesla-related news headlines published in the last 30 days

2.Technical Signals
We combine sentiment data with:

Daily returns
Volume changes
Lagged percentage movements
Rolling volatility
Cumulative returns

This multi-angle feature engineering allows the model to learn not just what the market is doing, but why.

3. Dual-Layered Prediction Models

We use two prediction targets:

Classification: Will the stock go up or down tomorrow?
Regression: What will the closing price be?

Each task has its own dedicated pipeline, trained and optimized using GridSearchCV and ensemble learners

Sentiment Analysis

We analyzed sentiment polarity from over 200 Tesla-related headlines using:

VADER for rule-based sentiment scoring
TextBlob for secondary validation of tone

These scores were averaged per day and then:

Lagged to account for delayed market reaction
Differenced to observe sharp changes
Combined with volatility and price return features

Fig : Sentimient Scores averaged and normalized

📈 Technical Indicators and Visual Exploration

RSI (Relative Strength Index)

We calculated a 14-day RSI to identify momentum extremes. RSI values near 70 suggested overbought conditions, while values near 30 indicated oversold environments. These insights often preempted trend reversals.

Visualizes the RSI_14 over time and marks potential trend reversal zones.

MACD (Moving Average Convergence Divergence)

We calculated the MACD as the difference between 12-day and 26-day exponential moving averages, along with a 9-day signal line. Their crossover points frequently aligned with bullish or bearish pivots.

Displays MACD and its signal line over time, highlighting crossover opportunities.

Candlestick Chart of TSLA

To give a complete view of Tesla’s market behavior, we used an interactive candlestick chart showing the Open, High, Low, and Close (OHLC) values for each trading day. This classic financial chart allows for precise reading of market movement within single days and across trends.

Moving Averages

We overlaid 3-day, 5-day, and 7-day simple moving averages on the closing price of TSLA to visualize short-term trends and their convergence/divergence.

Tracks short-term MA crossovers and lag behavior relative to closing price.

WordCloud: Thematic Frequency of News Headlines

We generated a word cloud from all Tesla news headlines to extract dominant themes driving public discourse.

Reveals frequent terms across sentiment windows

Pairplot of Stock Features

To explore internal relationships among numeric indicators, we generated a Seaborn pairplot of key financial features: Close, Volume, Daily_Return, and Volatility.

Uncovers visible patterns and clusters between key variables in TSLA data

Return and Volatility Subplot

We constructed a two-row subplot showing daily return and 5-day rolling volatility side by side. This allowed us to observe how sharp return changes often led to sustained volatility over time.

Cumulative Return and Volume Change Subplot

Another subplot visualized cumulative return and daily volume change, making it easy to identify whether market moves were backed by volume or occurred in low-liquidity conditions.

Correlation Heatmap of Engineered Features

To uncover relationships between the engineered features used in the model, we generated a correlation matrix heatmap that visualizes the strength and direction of pairwise correlations among key numeric and text-derived variables. This matrix includes traditional indicators like High, Close_outlier, and MA3, as well as text-derived TF-IDF features from the headline corpus such as TFIDF_doge,TFIDF_model, etc.,

The heatmap reveals several important insights:

Diagonal dominance of yellow blocks (value = 1.0) confirms self-correlation, validating the metric structure.
Moderate to strong correlations appear between news count (Nw_Count) and normalized sentiment (Nw_Norm_Sent), suggesting days with more headlines also tend to exhibit more sentiment variability.
Some TF-IDF tokens such as "elon", "model", and "company" show mild clustering effects, indicating co-occurrence or thematic alignment in news narratives.
Traditional stock indicators like High and MA3 remain largely uncorrelated with textual features, underscoring the complementary nature of news sentiment in the modeling process.

This visualization served as a sanity check and feature refinement tool, ensuring that no multicollinearity issues would distort model performance and confirming that our sentiment metrics contribute orthogonal, non-redundant signals to the prediction task.

Correlation Heatmap

Machine Learning Models and Prediction Architecture

Classification Models – Will the stock Go Up or Down?

RandomForestClassifier Ensemble learning using multiple decision trees

SVC (Support Vector) Kernel-based margin optimization

XGBClassifier Boosted decision tree with gradient optimization

VotingClassifier Combines predictions from all models for robustness

Each model was trained using a time-series cross-validation approach and tuned with GridSearchCV

Feature Importance

To interpret which features influenced model decisions the most, we extracted feature importances from XGBoost models. Key contributors included:

Sentiment_Change, Lagged_Sentiment
MACD, RSI_14
Price_Pct_Change, Vol_5

Regression Model – What Will the Price Be?

RandomForestRegressor - Non-linear, ensemble-based regression

Tested on:
R² Score
MAE (Mean Absolute Error)
RMSE (Root Mean Square Error)

📈 Model Insights

Sentiment change and lagged volatility were strong predictors for price movement.
MACD crossovers aligned closely with high-accuracy classification events.
RSI extremes (near 30/70) correctly predicted reversals in over 60% of test samples.

Key Takeaways

Combining textual sentiment with technical signals improves both classification and regression accuracy.
MACD and RSI aren’t just technical decor — they interact meaningfully with sentiment to shape market behavior.
The model does not rely on deep learning, but leverages interpretable, tunable ensemble methods.

Real-World Use Cases

Investor Advisory Tools — Use sentiment + RSI/MACD to offer short-term movement signals
Financial NLP Dashboards — Visualizing how news cycles map to technical market states
Academic Research — Benchmarking sentiment-enhanced price modeling frameworks

What This Project Demonstrates

Full-stack integration of market data + news sentiment

Engineering of time-series + textual features

Deep use of financial technical indicators

Visualization-driven interpretation of model behavior

Clear path to extension: multi-asset, intraday, or social media sentiment

Source Code : GITHUB

References :