In this project, we aim to use machine learning models to help predict the price and price direction of oil.
- Description
- Goals
- Technologies
- Instructions
- Conclusion
- Contributors
- References and Resources
- License
Our goal is to compare two or more machine-learning models for identifying price and price direction of oil. For our predictions, we will use natural language processing to draw insights from news articles for the past 22 years. In addition, we will use oil close prices/returns, gold prices, S&P 500, as well as times of unrest (Iraq War 2003-2011). Machine learning typically requires extensive data preparation before the model can be trained. We will use Jupyter to prepare a training and testing dataset, and to train and compare the machine-learning model.
Our portfolio analysis will use the following technologies:
- pandas
- numpy
- datetime
- pathlib
- nltk
- matplotlib
- analyzer
- dotenv
- New York Times API
- yfinance API
- warnings
- tensorflow
- To get the project started on your local machine, clone the GitHub repository.
- The first file we want to run is the crude_news_data. This will get the New York Times API data for a set amount of years. This may take around 45 minutes to run...
- The end result of this notebook will export a combined_csv file in a headlines folder, with all other articles throughout each month.
- Next, we use the crude_sentiment notebook that will get the news data from the combined_csv and run a sentiment analysis which will export an oil_sentiments csv file.
- Once we have the sentiment analysis data, we will load historical oil data and apply time series analysis and modeling to determine whether there is any predictable behavior in the oil_series_analysis notebook.
The oil price prediction worked better with the LSTM model compared to Linear Regression Model and Bayesian Ridge Model. While the Linear Regression uses one feature to predict the price, the Bayesian Ridge model used the five features considered and predicted the price using a normal distribution and probability. The price direction under the classification model worked slightly better in the random forest classifier compared to logistic regression. The feature importance of war in the price prediction was identified to be minimal compared to other features considered which could also be due to the fact that we had considered only one war period (due to lack of data availability).
1. How has oil prices behaved in the past 22 years?
2. What is the sentiment of oil across the period based on news articles using NLP?
3. Identify other features for oil price movements (based on avialability of data)
4. Compare model performances with each other when predicting oil prices.
-Linear Regression
-LSTM
-Bayesian Ridge
5. Compare model performances with each other when predicting oil returns direction.
-Logistic Regression
-Random Forest
6. Compare feature importance in the movement of oil prices.
Our team:
CNN Iraq War News
Yahoo Finance
How to Collect Data From The New York Times Over Any Period of Time
New York Times API
Introduction to Bayesian Linear Regression
Bayesian Ridge Regression
Copyright © 2022