[{"content":" ","date":null,"permalink":"/blog/","section":"Blog","summary":" ","title":"Blog"},{"content":"","date":null,"permalink":"/tags/gaming/","section":"Tags","summary":"","title":"gaming"},{"content":"","date":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"python"},{"content":" The Backstory: Why I Did This #I\u0026rsquo;ve been playing this new game for a month now (Path of Exile 2) and to tackle the hardest content, you need really good gear, and to get that gear, you need a ton of in-game currency. The problem? The economy is wild. Players are constantly discovering new ways to farm currency, but as soon as a method gets popular, it stops being profitable.\nAs someone who loves programming and data (and maybe spends too much time thinking about spreadsheets), I thought: “Why not use my coding skills to figure out the best way to make currency?”\nSo, I set out to simulate currency conversion strategies in the game, specifically focusing on reforging essences to determine if I could find a profitable pattern, for instance, obtaining higher-tier essences that sell for the highest amount of currency (Exalted Orbs).\nThe Problem: Essence Reforging in PoE #In PoE, \u0026ldquo;Essences\u0026rdquo; are crafting materials that can be reforged into higher-tier versions. The catch? Success rates are low, and outcomes are random. My question was simple:\nIf I reforged thousands of essences, would I turn a profit or go bankrupt?\nTo answer this, I built a Python simulator to model the process, incorporating:\nConversion Probabilities: A 2.26% success rate for upgrades to a higher tier (from local and community data). Market Prices: Exalted Orb values for each essence type (spoiler: Greater Essence of Haste = $$$). Protected Resources: Some lower tier essences (e.g., Electricity) are too valuable to reforge. Building the Simulator #1. Modeling the Economy #Real-World Limitations\nReforging essences comes with constraints that shape realistic simulations:\nInventory Space: Players inventory and chest storage are limited. Market Volatility: Cheap essences are scarce in bulk, and prices fluctuate rapidly. Taxes: A transaction tax applies to marketplace sales in gold. Time: Reforging is time-consuming, and market shifts can erase profits. Simulation Design\nTo mirror these challenges, the simulator:\nRuns 100,000 independent investment attempts (not endless reforging). Each attempt consist of 1000 reforges at the minimum. These many reforges equals to an affordable investment, reasonable inventory space, gold taxes and time consumed (25 minutes). Tracks metrics like average profit, loss streaks, and risk-adjusted returns. Key Parameters\nInvestment: 1,300 Exalted Orbs (avoids ruin in 95% of scenarios). Essence Input: 3,000 low-tier essences (2.3 Exalted each). Mechanics: Success (2.26%): Gain a random higher-tier essence. Failure: Receive one random minor essence (excluding protected Electricity/Haste). # Simplified simulation loop while reforges_possible: consume_3_essences() if success(probability=0.0226): reward = random_greater_essence() else: refund = random_minor_essence() 2. Statistical Significance #In PoE, reforging is a game of extremes. Most reforges grant you a low tier essence but a few can reward you with high-value essences like Greater Essence of Haste (worth 186 Exalted) or Greater Essence of the Infinite (worth 96 Exalted). Here’s the catch:\nHaste has a 0.1% drop rate. That means, on average, you’d need to reforge 1,000 times just to see one Greater Essence of Haste. Infinite has a 0.25% drop rate. Slightly better, but still rare. In my early simulations, these rare events skewed the results. A single Haste drop could make an attempt look insanely profitable, while a streak of bad luck could wipe out my entire bankroll. This made it impossible to draw meaningful conclusions from even 100,000 attempts.\nThe Turning Point: Scaling Up #To get reliable results, I realized I needed to simulate a lot more reforges.\nLaw of Large Numbers:\nThe more trials you run, the closer your results will get to the true probabilities. With 100,000 attempts (100M reforges), I might see ~100,000 Haste drops. But with 1 million attempts (1 billion reforges), I’d expect ~1,000,000 Haste drops, giving a much clearer picture of their impact. Reducing Variance:\nSmall samples are noisy. By scaling up, I could smooth out the randomness and see the underlying trends. Capturing Rare Events:\nRare events like Haste drops dominate the profit distribution. Without enough trials, their impact is either overstated (if they happen early) or understated (if they don’t happen at all). Scaling up ensures these events are properly represented in the results. So, I scaled up to 1 million simulations.\nResults #Average profit per attempt: 385 Exalted (95% CI: 384–386)\n1. Profit Breakdown by Essence Type #Minor Essence of Electricity is the most significant contributor, accounting for 30% of the total profit, followed by Greater Essence of Infinite and Greater Essence of Haste, contributing 18% and 15%, respectively. Despite being classified as a minor essence, minor Essence of Electricity has a larger impact than several greater essences.\nEssence Profit Contribution 1 Minor Essence of Electricity 30% 2 Greater Essence of Infinite 18% 3 Greater Essence of Haste 15% 4 Greater Essence of Mind 7% 5 Greater Essence of Electricity 6% 6 Greater Essence of Sorcery 6% 7 Greater Essence of Torment 4% 8 Minor Essence of Haste 3% 9 Greater Essence of Enhancement 2% 10 Greater Essence of Battle 1% 2. Probability of Loss # Formula: \\(\\text{Probability of Loss} = \\frac{\\text{Losing Attempts}}{\\text{Total Attempts}} \\)\n\\(\\text{Probability of Loss} = 9.8\\% \\)\n3. Win Rate / Loss Rate Ratio #Formula: \\( \\text{ WL Ratio } = \\frac{\\text{Probability of Profit}}{\\text{Probability of Loss}} \\)\n\\( \\text{ WL Ratio } = 9.20 \\)\n4. Value at Risk (VaR) #Formula: \\( \\text{VaR} = \\text{Percentile of Losses at Confidence Level} \\)\n\\( \\text{VaR}_{95\\%}= 86.85 \\text{ Exalted Orbs} \\)\n5. Risk-Adjusted Return (Sharpe Ratio) # Formula: \\(\\text{Sharpe Ratio} = \\frac{\\text{Average Profit}}{\\text{Standard Deviation of Profit}} \\)\n\\(\\text{Sharpe Ratio} = 1.24 \\)\n6. Expected Value (EV) per Attempt # Formula: \\({EV} = (\\text{Average Profit} \\times \\text{Probability of Profit}) - (\\text{Average Loss} \\times \\text{Probability of Loss}) \\)\n\\({EV} = 358.68 \\text{ Exalted Orbs } \\)\nConclusion #After simulating 1 million reforging attempts (1 billion reforges), the results are clear: reforging essences in Path of Exile 2 is a low-risk, high-reward strategy but only if you’re willing to spend 30 minutes repeating the same task. Here’s what the numbers tell us:\nProfitability:\nAverage Profit: 385 Exalted per attempt (95% CI: 384–386). Win/Loss Ratio: 9.2 (90.2% win rate vs. 9.8% loss rate). Expected Value: 358.68 Exalted per attempt. Risk Management:\nSharpe Ratio: 1.24. Value at Risk (VaR): 95% of attempts lose \u0026lt;86.85 Exalted. Key Drivers:\nElectricity Essences: Contributed 30% of total profits. Haste \u0026amp; Infinite: Combined for 33.6% of profits, despite their rarity. Thank you for reading! If you have any questions or comments, please feel free to contact me. Your feedback is highly appreciated.\nKeywords: Python, Simulation, Path of Exile 2, Poe2\n","date":"17 January 2025","permalink":"/blog/simulation-essence-poe2/","section":"Blog","summary":"The Backstory: Why I Did This #I\u0026rsquo;ve been playing this new game for a month now (Path of Exile 2) and to tackle the hardest content, you need really good gear, and to get that gear, you need a ton of in-game currency.","title":"Running Path of Exile 2 Simulations: Essence"},{"content":"","date":null,"permalink":"/tags/simulation/","section":"Tags","summary":"","title":"simulation"},{"content":" ","date":null,"permalink":"/tags/","section":"Tags","summary":" ","title":"Tags"},{"content":" ","date":null,"permalink":"/","section":"Welcome to my webpage!","summary":" ","title":"Welcome to my webpage!"},{"content":"","date":null,"permalink":"/tags/aws/","section":"Tags","summary":"","title":"AWS"},{"content":" Introduction #Data is the foundation of decision-making in today’s digital world. From detecting fraud to optimizing business operations, organizations rely on efficient data pipelines to ingest, process, and analyze vast amounts of information. But managing these pipelines at scale requires more than just traditional databases, it demands a robust, scalable architecture.\nThis project was born from a course I took during my master’s in Data Analytics, specifically on Advanced Database Systems, where we set out to design and deploy a Data Lake on AWS. The goal? To build a system capable of seamlessly integrating data from multiple sources while ensuring efficiency, automation, and scalability. Rather than just setting up services, I focused on structuring a pipeline that could handle real-world data challenges such as processing structured and semi-structured data, automating ingestion, and optimizing costs.\nIn this blog, I’ll walk you through my journey, along the way, I’ll highlight key lessons learned, trade-offs in architectural decisions, and improvements I’d make in future iterations.\nBackground and Context #For our final project, we were given two options by our instructor:\n1️⃣ Option A → Develop an advanced query system that interacts with multiple databases in the cloud.\n2️⃣ Option B → Build a Data Lake in AWS, following an end-to-end architecture demonstrated in a reference video.\nThe video, Building Data Lakes on AWS, provided a step-by-step guide to setting up a Data Lake using services like AWS Glue, Athena, and S3. However, the instructor made it clear that we were not required to replicate the video exactly, giving us the flexibility to modify the implementation as needed.\nWe decided to go with Option B, but we wanted to avoid AWS Glue to eliminate costs during development. This required additional changes, such as replacing Glue with other AWS services and setting up a scheduler to automate data processing. These modifications introduced new technical challenges and added complexity to the development process.\nProject Overview #At its core, this project was about building a cloud-based Data Lake using AWS. But what does that mean in practice? Here’s a high-level breakdown:\nData Storage: An AWS S3 bucket serves as the main repository for incoming database data. Database Infrastructure: Multiple databases (MySQL, SQL Server, and PostgreSQL) are deployed using Amazon RDS. Cost Management with CloudFormation: CloudFormation templates were used to deploy and delete databases daily during development. This approach helped minimize costs while the project was still evolving. ETL Processing: AWS Lambda functions handle the Extract, Transform, Load (ETL) processes to clean and move data. Orchestration: Amazon EventBridge triggers and schedules Lambda functions to process data efficiently. Security: AWS Secrets Manager ensures safe handling of database credentials. Challenges \u0026amp; Learnings #Like any technical project, this one came with its fair share of challenges:\nSetting up secure connections between RDS instances and Lambda functions. Optimizing costs by leveraging AWS free tier resources. Automating the entire workflow while maintaining flexibility for future changes. By the end, I had a functional Data Lake architecture, capable of ingesting data from multiple sources and preparing it for analysis. It’s not perfect, but it’s a solid foundation.\nThe Personal Journey #Every project has a technical side, but there’s always a personal journey behind the code. This project was no exception.\nHow It Started #Setting up a Data Lake on AWS was a great opportunity to work with multiple AWS services in a structured way. I was already familiar with AWS, but this project required integrating multiple services while keeping costs low. It was the perfect chance to learn, break things, and figure out how to make them work in a real-world scenario.\nAWS documentation is useful, but it often focuses on how to configure services rather than why certain decisions matter. Instead of just following guides, we tested different setups to find the best approach for our project. This hands-on experimentation helped us refine our architecture, troubleshoot issues, and make well-informed choices.\nKey Challenges \u0026amp; How We Overcame Them #Database Connectivity Issues #Before launching AWS RDS, we tested connectivity with our existing databases on Azure and MongoDB. This early testing helped us verify database interactions, but due to time constraints, we decided to host everything in Amazon RDS for consistency.\nThe next challenge came when configuring VPC networking. Running Lambda inside a VPC required setting up proper security groups, subnets, and NAT gateways. To debug network access, we launched an EC2 instance inside the VPC and used it to test connections to RDS. This helped us quickly identify and adjust security rules, allowing smooth database communication.\nSolution # Used an EC2 instance inside the VPC to test connections before deploying Lambda functions. Configured security groups and subnet routing to allow Lambda functions to access RDS securely. Ensured proper IAM roles and VPC endpoints were in place for efficient database interaction. Automating Infrastructure with CloudFormation #I had previous experience with CloudFormation, so deploying resources through templates wasn’t a challenge. However, the main issue was ensuring database configurations worked with AWS low tier instances. Some RDS parameter settings weren’t compatible with low-tier instances, causing deployment failures.\nSolution # Adjusted database parameter groups to match tier limitations. Iteratively deployed CloudFormation stacks to identify resource compatibility issues. This approach allowed us to deploy databases quickly and delete them after daily development to keep costs low.\nETL Processing \u0026amp; Event Scheduling #The main challenge with AWS Lambda was setting up custom layers for handling database queries. We needed external Python libraries, but we encountered local machine limitations, layer size limits and library compatibility issues.\nTo solve this, we used our EC2 instance to build the Lambda layer, ensuring all dependencies were properly packaged. The layer was then uploaded to S3 and linked to the Lambda functions.\nSolution # Built the Lambda layer in an EC2 instance to handle package dependencies. Compressed and uploaded the layer to S3 for easy reuse. Adjusted Lambda function memory and timeout settings for better performance. Once the Lambda functions were running efficiently, we scheduled them using Amazon EventBridge to automate the ETL pipeline.\nWhat I Would Improve Next Time #🔹 More automation: While CloudFormation helped, incorporating Terraform might provide even greater flexibility in infrastructure management.\n🔹 Logging \u0026amp; monitoring: Adding AWS CloudWatch alerts would improve visibility into failures and system performance.\nTechnical Deep Dive #Now that we’ve covered the journey, let’s get into the nitty-gritty of how this Data Lake was built. This section will break down each major component, from data ingestion to processing and storage, highlighting key configurations and best practices along the way.\nData Ingestion: Setting Up AWS S3 as the Data Lake #The first step in building a Data Lake is defining where data will be stored. In this case, Amazon S3 serves as the foundation, acting as a scalable, cost-effective storage solution.\nCreating the S3 Bucket #To set up the Data Lake bucket, I used the AWS console and followed best practices:\nDisabled ACLs → Ensures all objects remain owned by my account. Blocked public access → Prevents unintended data exposure. Versioning enabled → Maintains historical versions of objects in case of errors. Default encryption (AES-256) → Protects data at rest. aws s3api create-bucket --bucket utp-database-data-lake-project --region us-east-1 Once the bucket was created, it became the central repository for data ingestion. All raw data from multiple databases (MySQL, SQL Server, and PostgreSQL) was first stored here before further processing.\nInfrastructure as Code: AWS CloudFormation #Manually provisioning databases and networking resources is inefficient, so I automated the process using AWS CloudFormation. CloudFormation allows you to define infrastructure in a YAML template, making deployments repeatable and scalable.\nDatabase Stack (RDS Instances) #I modified an AWS-provided CloudFormation template to deploy three Amazon RDS instances:\nMySQL (for e-commerce transactions) SQL Server (for CRM data) PostgreSQL (for enterprise management data) Each database instance had the following configuration:\nInstance type: db.t3.micro Storage: 20GB Security: IAM role integration with AWS Secrets Manager for credential storage Networking: Private subnet with VPC security groups Resources: MySQLInstance: Type: AWS::RDS::DBInstance Properties: DBInstanceIdentifier: mysql-instance DBName: db_mysql DBInstanceClass: db.t3.micro AllocatedStorage: 20 Engine: MySQL EngineVersion: \u0026#34;8.0.39\u0026#34; MasterUsername: Fn::Sub: \u0026#34;{{resolve:secretsmanager:arn:aws:secretsmanager:us-east-1:123456789㊙️utp/database/rds-ASDAS:SecretString:MySQLDBUsername}}\u0026#34; Once deployed, the CloudFormation Outputs section provided endpoints for connecting to each database instance.\nData Processing with AWS Lambda (ETL Pipelines) #Extracting, transforming, and loading (ETL) data efficiently is crucial for a well-functioning Data Lake. AWS Lambda was used to automate data extraction, process raw data, and store refined versions in S3.\nETL Workflow #Each Lambda function was responsible for a different step in the ETL pipeline:\nExtract data from RDS Transform raw records into structured formats (Parquet, CSV) Load processed data back into S3 import os import json import logging from io import BytesIO import pymysql import boto3 from botocore.exceptions import ClientError import pandas as pd # Setup logging logger = logging.getLogger() logger.setLevel(logging.INFO) # Initialize AWS clients secrets_client = boto3.client(\u0026#39;secretsmanager\u0026#39;, region_name=\u0026#39;us-east-1\u0026#39;) s3_client = boto3.client(\u0026#39;s3\u0026#39;) # Retrieve configuration from environment variables SECRET_ARN = os.environ.get(\u0026#39;SECRET_ARN\u0026#39;) def lambda_handler(event, context): try: # Retrieve database credentials creds = get_db_credentials(SECRET_ARN) endpoint = creds[\u0026#39;MySQLODBEndpoint\u0026#39;] # Query the database and obtain the result as a pandas DataFrame df = query_database(endpoint, creds) ... Event-Driven Processing # AWS EventBridge was used to trigger Lambda functions every 5 minutes, ensuring near real-time data updates. Each Lambda function processed a different table, ensuring modularity. Processed data was stored in an S3 \u0026ldquo;Refined\u0026rdquo; bucket, ready for analysis. Security \u0026amp; Credential Management with AWS Secrets Manager #Storing database credentials in code is risky. To enhance security, AWS Secrets Manager was used to store RDS credentials securely.\nSteps Taken:\nCreated a secret for each database instance. Restricted access using AWS IAM policies (only Lambda functions could retrieve credentials). Enabled automatic credential rotation for better security. { \u0026#34;SecretId\u0026#34;: \u0026#34;rds-secret\u0026#34;, \u0026#34;Database\u0026#34;: \u0026#34;MySQL\u0026#34;, \u0026#34;Username\u0026#34;: \u0026#34;admin\u0026#34;, \u0026#34;Password\u0026#34;: \u0026#34;securepassword\u0026#34; } Data Governance: Monitoring \u0026amp; Logging with CloudWatch #Keeping track of ETL job failures and database health was essential. AWS CloudWatch Logs helped monitor:\nLambda execution success/failure rates Query execution times \u0026amp; potential bottlenecks Database CPU and memory utilization Setting Up CloudWatch Alarms #CloudWatch was configured to send email alerts if:\n✅ A Lambda function failed more than 3 times in a row\n✅ An RDS instance exceeded 80% CPU usage for more than 5 minutes\nResources: LambdaErrorAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: LambdaErrorCount MetricName: Errors Namespace: AWS/Lambda Statistic: Sum Period: 60 EvaluationPeriods: 3 Threshold: 3 AlarmActions: - !Ref SNSAlertTopic Data Querying \u0026amp; Analysis: AWS Athena \u0026amp; Visualization Tools #Once data was stored in the Refined S3 Bucket, it needed to be easily accessible for analysis. Instead of setting up a traditional data warehouse, AWS Athena was used to run SQL queries directly on S3 data.\nDefining the Athena Table #CREATE EXTERNAL TABLE IF NOT EXISTS tickit_sales ( salesid INT, listid INT, sellerid INT, buyerid INT, eventid INT, dateid SMALLINT, qtysold SMALLINT, pricepaid DOUBLE, commission DOUBLE, saletime STRING, refined_timestamp TIMESTAMP ) STORED AS PARQUET LOCATION \u0026#39;s3://utp-database-data-lake-project/tickit/refined/sales/\u0026#39;; Now, data could be queried instantly using standard SQL.\nConnecting to BI Tools #The final step was integrating Athena with Power BI \u0026amp; Grafana for real-time visualizations.\nWrapping Up the Tech Stack #By the end of the project, the full architecture looked like this:\n1️⃣ AWS S3 → Stores raw \u0026amp; processed data.\n2️⃣ AWS RDS (MySQL, SQL Server, PostgreSQL) → Databases feeding the Data Lake.\n3️⃣ AWS CloudFormation → Automates database \u0026amp; infrastructure setup.\n4️⃣ AWS Lambda → Runs ETL jobs for data transformation.\n5️⃣ AWS EventBridge → Automates ETL scheduling.\n6️⃣ AWS Secrets Manager → Manages database credentials securely.\n7️⃣ AWS CloudWatch → Monitors system health \u0026amp; logs failures.\n8️⃣ AWS Athena → Enables SQL querying on S3 data.\n9️⃣ Power BI / Grafana → For reporting \u0026amp; monitoring.\nLessons Learned and Future Directions #Every project brings a mix of successes and challenges. Some things work exactly as planned but that’s where the best learning happens. This section dives into my biggest takeaways from building the AWS-based Data Lake and what I would do differently next time.\nKey Lessons Learned #1. Managing costs in AWS requires careful planning #I realized that poor resource allocation can drive up costs fast. Keeping unused RDS instances running can lead to unnecessary expenses.\n✅ Takeaway:\nUse AWS Cost Explorer to monitor spending in real time. Leverage S3 lifecycle policies to automatically move old data to Glacier. Consider on-demand vs. reserved instances for RDS if running long-term projects. 2. ETL pipelines should be more modular #Initially, each AWS Lambda function handled a different data extraction process, but as I added more databases, managing multiple ETL functions became complex. If one function failed, debugging became a nightmare.\n✅ Takeaway:\nInstead of multiple small Lambda functions, consider orchestrating ETL workflows with AWS Step Functions for better error handling. Use Amazon Glue instead of Lambda for large-scale ETL workloads. 3. Logging and Monitoring are lifesavers #AWS CloudWatch logs helped me debug ETL job failures and database connectivity issues.\n✅ Takeaway:\nSet up CloudWatch Alarms to get notified of failures. Use Athena query logs to optimize performance and avoid slow queries. Future Improvements and Next Steps #This project was a great learning experience, but there are a few things I’d improve or expand on in the future:\n1. Replace Lambda ETL with AWS Glue #AWS Glue is serverless, scalable, and better suited for complex ETL tasks. Moving from Lambda to Glue would reduce complexity and provide better schema management.\n2. Implement Data Lake Permissions with AWS Lake Formation #Right now, S3 stores all the data, but there’s no fine-grained access control. AWS Lake Formation would allow better permissions management by setting role-based access policies on stored datasets.\n3. Automate More with Terraform #CloudFormation was great, but Terraform offers multi-cloud compatibility and more flexibility. Migrating infrastructure automation to Terraform would make deployments even smoother.\n4. Expand Data Visualization with Amazon QuickSight #I used Power BI and Grafana, but Amazon QuickSight could be a better alternative for integrating directly with AWS.\nConclusion #In this project, I’ve shared our approach to building a scalable data lake on AWS and the lessons I learned along the way. A special thanks to my friends Andy Sanjur, Harris Yearwood and Isaac Ávila for their contributions and support throughout this journey.\nAlong the way, we faced several challenges debugging network issues, optimizing cost efficiency, and improving security measures. But in overcoming them, we gained valuable hands-on experience with AWS services and infrastructure as code (IaC).\nFinal Thoughts \u0026amp; Takeaways #✅ Cloud infrastructure requires a balance between automation and flexibility\nUsing CloudFormation simplified deployment, but fine-tuning configurations required manual intervention. Next time, I’d explore Terraform for more flexibility.\n✅ ETL pipelines should be scalable and maintainable\nLambda worked for small-scale ETL, but AWS Glue would be a better long-term solution. Step Functions could also improve error handling.\n✅ Cost optimization is an ongoing process\nEven within the free tier, AWS costs can spiral if not monitored. Tools like Cost Explorer and S3 lifecycle policies help control expenses.\n✅ Security should be a priority from the start\nStoring credentials in AWS Secrets Manager, implementing IAM role-based permissions, and blocking public access to S3 were key security measures.\nThank you for reading! If you have any questions or comments, please feel free to contact me. Your feedback is highly appreciated.\nKeywords: AWS, Data Lake, Cloud Computing, Database Systems, ETL, CloudFormation, S3, RDS, Lambda, Data Engineering\n","date":"20 December 2024","permalink":"/portfolio/aws-database-data-lake/","section":"Portfolio","summary":"Introduction #Data is the foundation of decision-making in today’s digital world.","title":"Building a Data Lake on AWS"},{"content":"","date":null,"permalink":"/tags/datalake/","section":"Tags","summary":"","title":"Datalake"},{"content":" ","date":null,"permalink":"/portfolio/","section":"Portfolio","summary":" ","title":"Portfolio"},{"content":"","date":null,"permalink":"/tags/classification/","section":"Tags","summary":"","title":"classification"},{"content":" Predicting customer churn is not just about identifying who is likely to leave; it\u0026rsquo;s about understanding the financial implications behind each customer\u0026rsquo;s departure. In addition, one of the complexities in churn prediction is dealing with imbalanced datasets, where the number of non-churning customers vastly outnumbers the churning ones.\nToday, I\u0026rsquo;ll build a cost-sensitive customer churn prediction model using machine learning. I\u0026rsquo;ll delve into data exploration, feature engineering, model training, and evaluation with the goal to minimize financial losses associated with customer churn. At the same time, I\u0026rsquo;ll analyze the performance of a classification model during an imbalance dataset scenario.\nIntroduction #Customer churn is a critical metric for businesses, especially in industries like telecommunications, banking, and subscription-based services. Churn prediction models help identify customers who are likely to discontinue using a company\u0026rsquo;s products or services. By proactively addressing churn, companies can implement targeted retention strategies, thereby saving significant revenue.\nTraditional churn prediction models often focus solely on accuracy, neglecting the financial ramifications of different types of prediction errors. For instance, the cost of incorrectly predicting that a loyal customer will churn (false positive) is different from failing to identify a churning customer (false negative). To address this, I adopt a cost-sensitive approach that incorporates the business costs associated with each type of error directly into the model.\nIn this post, I’ll build a churn prediction model that considers these costs. Plus, I’ll address the issue of class imbalance, where most data points are for non-churn customers. Imbalance like this can make the model overlook actual churners, which isn’t helpful for a business looking to keep customers around.\nBusiness Scenario #Before diving into the data and code, it\u0026rsquo;s essential to frame our problem in a real-world business context. Our goal is not just to predict churn but to minimize the financial impact of churn on the business.\nThe Cost Matrix #Define a cost matrix that quantifies the financial consequences of different prediction outcomes:\nPredicted Stay (0) Predicted Churn (1) Actual Stay (0) $0 -$200 Actual Churn (1) -$750 $550 True Negative (TN): Correctly predicting a customer will stay. Cost: $0. False Positive (FP): Predicting a customer will churn when they won\u0026rsquo;t. Cost: -$200 (cost of unnecessary retention efforts). False Negative (FN): Failing to predict a customer will churn. Cost: -$750 (loss due to customer leaving). True Positive (TP): Correctly predicting a customer will churn and taking action. Gain: $550 (benefit from retaining the customer). Note: Negative costs represent expenses, while positive costs represent gains.\nBy integrating this cost matrix into our model, I ensure that our predictions align with business objectives, focusing on maximizing profit rather than just statistical accuracy.\nUnderstanding the Class Imbalance Problem #Class imbalance occurs when one class in a classification problem is represented much more than other classes. In churn prediction, typically, most customers do not churn, leading to an imbalanced dataset. This imbalance can bias models towards the majority class, causing poor performance in predicting the minority class.\nThere is ongoing debate in the data science community about the best approach to handle class imbalance:\nResampling Techniques: Such as oversampling the minority class or undersampling the majority class. Class Weighting: Assigning higher weights to the minority class during model training. Algorithmic Adjustments: Using algorithms that are robust to class imbalance. Leave Data As-Is: Some argue that altering the dataset may distort the true distribution, and models should learn from the original data. In this project, I focus on applying class weighting and compare it with models trained on the original imbalanced data.\nDataset Overview #The dataset used in this project is sourced from Kaggle’s Bank Customer Churn Dataset by Radheshyam Kollipara.\nThe dataset includes the following features:\nRowNumber: Represents the row number. CustomerId: Unique identifier for each customer. Surname: The last name of the customer. CreditScore: Credit score of the customer. Geography: Country of residence. Gender: Customer’s gender. Age: Customer’s age. Tenure: Number of years the customer has been with the bank. Balance: Customer’s account balance. NumOfProducts: Number of bank products the customer is using. HasCrCard: Indicates if the customer has a credit card (1) or not (0). IsActiveMember: Indicates if the customer is an active member (1) or not (0). EstimatedSalary: Estimated annual income of the customer. Exited: Target variable showing whether the customer has churned (1) or stayed (0). Complain: Indicates if the customer has filed a complaint (1) or not (0). Satisfaction Score: Customer satisfaction score (1-5). Card Type: Type of card held by the customer (e.g., Silver, Gold). Points Earned: Loyalty points accumulated by the customer. Dataset preview:\ndataset.sample(5) RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Complain Satisfaction Score Card Type Point Earned 2503 2504 15583364 McGregor 476 France Female 32 6 111871.93 1 0 0 112132.86 0 0 3 GOLD 988 1362 1363 15683841 Hamilton 555 Germany Male 41 10 113270.20 2 1 1 185387.14 0 0 1 SILVER 398 842 843 15599433 Fanucci 660 Germany Male 35 8 58641.43 1 0 1 198674.08 0 0 5 PLATINUM 815 7919 7920 15634564 Aksyonov 593 Spain Male 31 8 112713.34 1 1 1 176868.89 0 0 2 GOLD 710 3512 3513 15657779 Boylan 806 Spain Male 18 3 0.00 2 1 1 86994.54 0 0 2 GOLD 768 Exploratory Data Analysis (EDA) #Data Cleaning #First, let’s check for missing values:\ndataset.isnull().sum() Result:\nCustomerId 0 CreditScore 0 Geography 0 Gender 0 Age 0 Tenure 0 Balance 0 NumOfProducts 0 HasCrCard 0 IsActiveMember 0 EstimatedSalary 0 Exited 0 Complain 0 Satisfaction Score 0 Card Type 0 Point Earned 0 dtype: int64 All columns have zero missing values.\nNext, let’s drop irrelevant columns that won\u0026rsquo;t contribute to our project\u0026rsquo;s objective:\ndata.drop(columns=[\u0026#39;RowNumber\u0026#39;, \u0026#39;CustomerId\u0026#39;, \u0026#39;Surname\u0026#39;], inplace=True) Statistical Summary #Generate a statistical summary to understand the distribution of numerical features:\ndata.describe() CustomerId CreditScore Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Complain Satisfaction Score Point Earned count 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 mean 1.56909e+07 650.529 38.9218 5.0128 76485.9 1.5302 0.7055 0.5151 100090 0.2038 0.2044 3.0138 606.515 std 71936.2 96.6533 10.4878 2.89217 62397.4 0.581654 0.45584 0.499797 57510.5 0.4028 0.4033 1.40592 225.925 min 1.55657e+07 350 18 0 0 1 0 0 11.58 0 0 1 119 25% 1.56285e+07 584 32 3 0 1 0 0 51002.1 0 0 2 410 50% 1.56907e+07 652 37 5 97198.5 1 1 1 100194 0 0 3 605 75% 1.57532e+07 718 44 7 127644 2 1 1 149388 0 0 4 801 max 1.58157e+07 850 92 10 250898 4 1 1 199992 1 1 5 1000 Insights:\nAge: Mean age is around 39 years, with a standard deviation of 10.5 years. Balance: Average balance is \\( \\$76,486 \\), but the standard deviation is high $62,397, indicating significant variability. Exited: Approximately 20% of customers have churned, showing class imbalance. Handling Class Imbalance #Class imbalance can bias the model towards the majority class (non-churned customers). I\u0026rsquo;ll address this issue during model training using techniques like class weighting and threshold adjustment.\nFeature Exploration and Engineering #Understanding relationships between features and the target variable is crucial.\nCorrelation Analysis #Compute the correlation matrix to identify linear relationships:\nKey Findings:\nAge has a weak positive correlation with Exited (0.29), suggesting older customers are more likely to churn. Complain has a strong positive correlation with Exited (0.99). This may indicate the existance of a complain solely in churned customers. Further analysis indicates that the strong correlation between complaints and churn is due to the fact that almost every customer who churns had filed a complaint before leaving. In contrast, customers who did not churn rarely filed complaints. This suggests that complaints are a strong indicator of dissatisfaction, which often leads to churn. This feature might lead to data leakage. I\u0026rsquo;ll consider dropping it.\ncomplain = dataset.groupby([\u0026#39;Complain\u0026#39;,\u0026#39;Exited\u0026#39;]).size().reset_index(name=\u0026#39;Count\u0026#39;) total = complain[\u0026#39;Count\u0026#39;].sum() complain[\u0026#39;Proportion\u0026#39;] = (complain[\u0026#39;Count\u0026#39;] / total ) Complain Exited Count Proportion 0 0 7952 0.7952 0 1 4 0.0004 1 0 10 0.001 1 1 2034 0.2034 Visualizing Key Features #Age Distribution # Observation:\nChurned customers tend to be older. Balance Distribution # Observation:\nChurned customers generally have higher account balances. Number of Products # The notch represents the mean. Observation:\nCustomers with one product are more likely to churn than those with multiple products. Tenure # Observation:\nTenure distribution is similar for both churned and non-churned customers, with both groups having a median tenure of 5 years. Churned customers display more variability in tenure, indicating they may leave at different stages in their bank relationship. Estimated Salary # Observation:\nChurned and non-churned customers have similar median balances around 100,000. Both groups show a similar spread in balances with no outliers. Satisfaction Score # Observation:\nBoth churned and non-churned customers have the same distribution, with a median of 3 and identical interquartile ranges. The lower and upper fences are also identical, with no outliers for either group. Geography # Observation:\nFrance: Has the largest customer base, with a churn rate of 16.2%. Germany: Shows a significantly higher churn rate at 32.4%, suggesting that German customers are more likely to churn. Spain: Has a churn rate similar to France at 16.7%. These differences indicate that geography might influence churn, with German customers showing a higher likelihood of leaving compared to those from France and Spain.\nSatisfaction Score # Observation:\nDiamond Card Holders: Have the highest churn rate at 21.8%. Gold Card Holders: Show the lowest churn rate at 19.3%. Platinum and Silver Card Holders: Have similar churn rates, around 20.3% and 20.1%, respectively. These results suggest that Diamond card holders may be more likely to churn, while Gold card holders are slightly more likely to stay. However, the differences in churn rates across card types are relatively small.\nFeature Extraction #Dropping Potentially Problematic Features #I consider dropping the Complain feature due to its near-perfect correlation with Exited, which could cause data leakage:\ndata.drop(columns=[\u0026#39;CustomerId\u0026#39;,\u0026#39;Complain\u0026#39;], inplace=True) Data Preprocessing #Train-Test Split #Split the data into training, validation, and test sets. This is done before the feature scalling to avoid data leakage.\nfrom sklearn.model_selection import train_test_split X = dataset.drop(columns=\u0026#39;Exited\u0026#39;) y = dataset[\u0026#39;Exited\u0026#39;] # Initial split to separate out the hold-out set X_train_val, X_test, y_train_val, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y) # Split the remaining data into training and validation sets X_train, X_val, y_train, y_val = train_test_split( X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val) Based on these splits, the final distribution is:\nDataset Total Count Percentage of Total Data Training 6,000 60% Validate 2,000 20% Testing 2,000 20% Column Transformer #from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler num_cols = [\u0026#39;CreditScore\u0026#39;, \u0026#39;Age\u0026#39;, \u0026#39;Tenure\u0026#39;, \u0026#39;Balance\u0026#39;, \u0026#39;NumOfProducts\u0026#39;, \u0026#39;EstimatedSalary\u0026#39;, \u0026#39;Satisfaction Score\u0026#39;,\u0026#39;Point Earned\u0026#39;] cat_cols = [\u0026#39;Geography\u0026#39;, \u0026#39;Card Type\u0026#39;, \u0026#39;Gender\u0026#39;] bin_cols = [\u0026#39;HasCrCard\u0026#39;,\u0026#39;IsActiveMember\u0026#39;] preprocessor = ColumnTransformer( transformers=[ # One-hot encoding for categorical variables (\u0026#39;one_hot_encoder\u0026#39;, OneHotEncoder(drop=\u0026#39;first\u0026#39;, sparse_output=False), cat_cols), # Standard scaling to numerical features (\u0026#39;standard_scaler\u0026#39;, StandardScaler(), num_cols), ], # Passthrough binary features remainder=\u0026#39;passthrough\u0026#39; ) preprocessor.fit(X_train) X_train = preprocessor.transform(X_train) X_val = preprocessor.transform(X_val) X_test = preprocessor.transform(X_test) feature_names = list(preprocessor.named_transformers_[\u0026#39;one_hot_encoder\u0026#39;] \\ .get_feature_names_out(input_features=cat_cols)) feature_names = feature_names + num_cols + bin_cols Defining the Cost Function #We define a custom cost function to evaluate our models based on the business cost matrix:\nfrom sklearn.metrics import confusion_matrix, make_scorer def cost_function(y_true, y_pred, neg_label=0, pos_label=1): cm = confusion_matrix(y_true, y_pred, labels=[neg_label, pos_label]) cost_matrix = np.array([ [0, -200], # [TN cost, FP cost] [-750, 550] # [FN cost, TP gain] ]) total_gain = np.sum(cm * cost_matrix) return total_gain cost_scorer = make_scorer(cost_function, greater_is_better=True, neg_label=0, pos_label=1) This function calculates the total profit (or loss) for a set of predictions, considering the costs associated with each type of prediction outcome.\nModel Training #I\u0026rsquo;ll train three different models:\nLogistic Regression Random Forest Classifier XGBoost Classifier Cross-Validation #Use stratified k-fold cross-validation to ensure that each fold has a similar class distribution:\nfrom sklearn.model_selection import cross_val_score, StratifiedKFold skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) Baseline Models #Start by training baseline models without handling class imbalance.\nLogistic Regression #from sklearn.linear_model import LogisticRegression lr_base = LogisticRegression(random_state=42) lr_base.fit(X_train, y_train) Random Forest #from sklearn.ensemble import RandomForestClassifier rf_base = RandomForestClassifier(random_state=42) rf_base.fit(X_train, y_train) XGBoost #from xgboost import XGBClassifier #The fraction of positive instances in the y_train_val dataset pos_frac = y_train_val.mean() xgb_base = XGBClassifier(random_state=42, base_score=pos_frac) xgb_base.fit(X_train, y_train) Weighted Models #Applied class weighting to address class imbalance.\nLogistic Regression with Class Weights #lr_balanced = LogisticRegression(random_state=42, class_weight=\u0026#39;balanced\u0026#39;) lr_balanced.fit(X_train, y_train) Random Forest with Class Weights #rf_balanced = RandomForestClassifier(random_state=42, class_weight=\u0026#39;balanced\u0026#39;) rf_balanced.fit(X_train, y_train) XGBoost with Scale Pos Weight #Calculate scale_pos_weight as the ratio of negative to positive classes.\nfrom collections import Counter counter = Counter(y_train) neg_class = counter[0] pos_class = counter[1] scale_pos_weight = neg_class / pos_class xgb_balanced = XGBClassifier(random_state=42, scale_pos_weight=scale_pos_weight, base_score=pos_frac) xgb_balanced.fit(X_train, y_train) Model Evaluation on Validation Data #Evaluation Metrics #Evaluate models using several metrics:\nAccuracy: Overall correctness. Precision: Correct positive predictions over total positive predictions. Recall: Correct positive predictions over actual positives. F1-Score: Harmonic mean of precision and recall. ROC-AUC: Area under the Receiver Operating Characteristic curve. MCC: Delivers a balanced assessment by considering all elements of the confusion matrix. Brier Score: Assesses the accuracy of probabilistic predictions by measuring the mean squared difference between predicted probabilities and actual outcomes. Model Name Accuracy Precision Recall F1 Score ROC AUC MCC LR Base 0.8040 0.5564 0.1818 0.2741 0.7719 0.2339 LR Balanced 0.7045 0.3808 0.7224 0.4987 0.7743 0.3492 RF Base 0.8610 0.7452 0.4816 0.5851 0.8418 0.5236 RF Balanced 0.8575 0.7480 0.4521 0.5636 0.8500 0.5065 XGB Base 0.8535 0.6696 0.5528 0.6057 0.8362 0.5203 XGB Balanced 0.8270 0.5695 0.6143 0.5910 0.8364 0.4821 Observation:\nRF stands out with the highest MCC and ROC AUC, indicating robust performance across multiple metrics. While RF Balanced has the higher ROC AUC, RF Base achieves the highest MCC. In terms of the F1 score, XGB achieves the best results. Cross Validated ROC AUC Scores # Model ROC AUC RF Balanced 0.8273 RF Base 0.8256 XGB Base 0.8143 XGB Balanced 0.8075 LR Balanced 0.7659 LR Base 0.7641 Calibration and Feature Importance #Calibration Plot #Calibration plots help assess how well predicted probabilities reflect actual probabilities.\nBrier Scores (lower is better) # Model Brier Score RF Balanced 0.1068 RF Base 0.1082 XGB Base 0.1111 XGB Balanced 0.1263 LR Base 0.1350 LR Balanced 0.1966 Observation:\nBalanced Random Forest has the best Brier score, making it the most accurate model for probability predictions. Balancing improves Random Forest slightly, but worsens performance for Logistic Regression and slightly for XGBoost. Feature Importance #To determine the most important features, I use SHAP (SHapley Additive exPlanations) which helps to undestand how much each feature contributes to a model\u0026rsquo;s predictions.\nObservation:\nAge, NumOfProducts, and IsActiveMember are consistently the most important features in RF and XGB models, indicating that age, number of products and active membership status significantly influence churn. Balancing the dataset has little effect on feature importance for each model. Tuning the Decision Threshold #By default, models classify samples as positive if the predicted probability is ≥ 0.5. However, this threshold might not be optimal for our cost-sensitive scenario.\nThreshold Tuning Process #Search for a threshold that maximizes our custom cost function with TunedThresholdClassifierCV from scikit-learn.\nfrom sklearn.model_selection import TunedThresholdClassifierCV tuned_model = TunedThresholdClassifierCV( model, scoring=cost_scorer, #Custom Business Scoring store_cv_results=True, ) tuned_model.fit(X_train, y_train) Results After Threshold Tuning # Observation:\nThe Post-tuned Balanced RF model yields the highest profit at $22,850, making it the top-performing model. Both tuned Logistic Regression models result in negative profits. Final Evaluation on Hold-Out Data #Test the models on a hold-out dataset to evaluate its final performance. Due to poor performance, Logistic Regression was excluded from these evaluations.\nModel Profitability # Observation:\nWith the exception of Balanced XGB, the threshold-tuned models outperform all non-tuned models in terms of profit on unseen data. The most profitable model is the threshold-tuned Random Forest baseline. Model Overall # Model Name Accuracy Precision Recall F1 Score ROC AUC MCC Base RF 0.8690 0.7897 0.4877 0.6030 0.8578 0.5518 Post-tuned Base RF 0.7125 0.4016 0.8358 0.5426 0.8578 0.4212 Balanced RF 0.8700 0.8217 0.4632 0.5925 0.8632 0.5526 Post-tuned Balanced RF 0.6480 0.3549 0.8873 0.5070 0.8632 0.3820 Base XGB 0.8470 0.6635 0.5074 0.5750 0.8403 0.4902 Post-tuned Base XGB 0.6585 0.3598 0.8652 0.5083 0.8403 0.3794 Balanced XGB 0.8400 0.5987 0.6544 0.6253 0.8439 0.5247 Post-tuned Balanced XGB 0.6980 0.3874 0.8260 0.5274 0.8439 0.3992 Best Model #Post-tuned Random Forest # Model Name Accuracy Precision Recall F1 Score ROC AUC MCC Post-tuned Base RF 0.7125 0.4016 0.8358 0.5426 0.8578 0.4212 Conclusion #Predicting customer churn is not just about identifying who is likely to leave, but also about understanding the financial impact of losing a customer. By adopting a cost-sensitive approach, this project aimed to align predictive modeling with business objectives, ensuring that retention strategies maximize financial gains.\nThe analysis demonstrated that handling class imbalance is crucial for effective churn prediction. While traditional models tend to favor the majority class, applying techniques like class weighting improved recall, allowing the model to better capture customers at risk of churning.\nAmong the models tested, the Post-Tuned Balanced Random Forest emerged as the most profitable option. By fine-tuning decision thresholds, this model optimized the balance between correctly identifying churners and minimizing false positives, ultimately leading to the highest financial gains on unseen data.\nMoreover, feature importance analysis revealed that Age, Number of Products, and Active Membership were key drivers of churn. Understanding these factors allows businesses to develop targeted interventions to retain customers effectively.\nKey Takeaways # Cost-Sensitive Approach Enhances Business Alignment\nIncorporating a cost matrix into the model evaluation ensured that predictions were aligned with financial outcomes rather than pure statistical accuracy. This approach helped balance profit maximization while mitigating unnecessary costs.\nClass Imbalance Affects Model Performance\nThe dataset exhibited a significant class imbalance, which was addressed using class weighting and cost-sensitive learning. Balanced models performed better in terms of recall, ensuring that more churn cases were correctly identified.\nPost-Tuned Random Forest Model Performed Best\nAfter threshold tuning, the Post-tuned Balanced Random Forest emerged as the most profitable model, achieving the highest financial gain on unseen data.\nFeature Importance Highlights Customer Behavior Patterns\nFeatures like Age, Number of Products, and Active Membership had the greatest impact on churn predictions, emphasizing the need for targeted retention strategies for specific customer groups.\nNext Steps # Customer Lifetime Value (CLV) Integration\nFuture models could incorporate Customer Lifetime Value (CLV) into the cost function to prioritize high-value customers and optimize resource allocation.\nDynamic Cost Model\nThe fixed cost matrix assumes uniform costs and gains across all customers, which may not reflect real-world variations. A dynamic, data-driven approach to cost estimation could improve financial decision-making.\nCustomer Experience Consideration\nWhile profit maximization was the primary goal, businesses should balance predictive actions with customer experience to avoid retention strategies that may backfire.\nOperational Readiness for Deployment\nBefore deploying the model, it is crucial to assess real-world constraints such as latency, scalability, and integration with existing customer management systems.\nReferences # Scikit-learn Documentation Learning from imbalanced data - EuroSciPy 2023 What Is Your Model Hiding? A Tutorial on Evaluating ML Models Thank you for reading! If you have any questions or comments, please feel free to contact me. Your feedback is highly appreciated.\nKeywords: Machine Learning, Cost-Sensitive Learning, Classification, Data Science, Business Analytics\n","date":"26 October 2024","permalink":"/portfolio/cost-sensitive-model/","section":"Portfolio","summary":"Predicting customer churn is not just about identifying who is likely to leave; it\u0026rsquo;s about understanding the financial implications behind each customer\u0026rsquo;s departure.","title":"Cost-Sensitive Customer Churn Prediction"},{"content":"","date":null,"permalink":"/tags/machine-leaning/","section":"Tags","summary":"","title":"machine leaning"},{"content":"Background #From an early age, I have been passionate about technology. Over time, my interests evolved from computer networks and cybersecurity to data analysis and data science, where I found my true motivation. During university, my favorite courses included research methodologies, statistics, expert systems, and data analysis, which further solidified my commitment to this field. Currently, I program in Python and C++, developing tools to optimize personal tasks and engaging in projects that facilitate continuous learning.\nExperience #I earned a bachelor\u0026rsquo;s degree in Systems and Computer Engineering from Universidad Tecnológica de Panamá. I gained experience through collaborative projects, which honed my skills in data analysis and problem-solving. During my final years in college, I took on side jobs in programming and tutoring college students.\nFrom 2022 to 2024, I worked for Credicorp Bank as an Information Security Analyst, handling the following responsibilities:\nFraud investigations, vulnerability assessments, interactive reports, and dashboards using Power BI Administer and maintain the department databases Development and implementation of pipelines for analyzing fraud patterns in on-premise and cloud environments Automation of security tasks and Infrastructure as Code (IaC) using Python Configuration of security policies, Identity and Access Management (IAM), incident detection and response, and infrastructure security in AWS and Azure Thanks to my friends at CCB 🎉 Currently, I am pursuing a Master of Science in Data Analytics at Universidad Tecnológica de Panamá and preparing for a cloud certification. I plan to focus my career on statistics and machine learning. I intend to participate in tech communities, attend hackathons and workshops to expand my knowledge, and network with professionals in the field.\nPython Community Meeting at Pycon 2023 Hobbies #In my spare time, I love playing video games, keeping up with the latest tech, and writing little scripts to make everyday tasks easier.\n","date":null,"permalink":"/about/","section":"Welcome to my webpage!","summary":"Background #From an early age, I have been passionate about technology.","title":"About"},{"content":"","date":null,"permalink":"/tags/cybersecurity/","section":"Tags","summary":"","title":"Cybersecurity"},{"content":"","date":null,"permalink":"/tags/flask/","section":"Tags","summary":"","title":"Flask"},{"content":"","date":null,"permalink":"/tags/machine-learning/","section":"Tags","summary":"","title":"Machine Learning"},{"content":" Introduction #Phishing attacks are one of the most common cybersecurity threats today, tricking users into providing sensitive information through malicious websites. With phishing techniques evolving, automated detection systems are crucial to stay ahead. That’s why I built a phishing URL classifier, a machine learning-powered web app that predicts whether a given URL is legitimate or fraudulent.\nIn this blog, I’ll walk through how I developed this project, from feature extraction to model training and building the web app using Flask.\nHow the App Works #The application allows users to input a URL. The system then performs the following steps:\nExtract Features: The app analyzes various characteristics of the URL, such as length, presence of special characters, domain age, and HTTPS usage. Model Prediction: A Random Forest classifier predicts if the URL is phishing (-1) or legitimate (1) based on these features. Display Results: The app returns the classification result along with the probability score. This automated process enables quick phishing detection without requiring manual intervention.\nFeature Extraction: What Makes a URL Suspicious? #A key part of phishing detection is feature engineering, which means defining measurable characteristics of URLs that help determine whether they might be fraudulent. Attackers often use deceptive techniques to trick users, and by analyzing different aspects of a URL, we can detect suspicious patterns. Here are the features our app considers:\nBasic URL Characteristics # URL Length: Longer URLs tend to hide malicious content. Legitimate (\u0026lt; 54 characters) Suspicious (54-75 characters) Phishing (\u0026gt; 75 characters) Presence of \u0026ldquo;@\u0026rdquo; Symbol: If a URL contains an @, it’s often a phishing attempt to mislead users. Redirects (\u0026quot;//\u0026quot; Usage): If a URL contains multiple \u0026ldquo;//\u0026rdquo; outside its normal position, it may be trying to disguise its real destination. Use of Hyphens in Domain: A domain with hyphens (e.g., \u0026ldquo;secure-bank-login.com\u0026rdquo;) is often a sign of phishing. Subdomain Count: Legitimate (0-1 subdomains) Suspicious (2 subdomains) Phishing (3+ subdomains) Shortened URLs: Attackers often use services like bit.ly or tinyurl.com to mask phishing links. Domain \u0026amp; Security Indicators # HTTPS Usage \u0026amp; SSL Certificate Validity: A missing HTTPS or an invalid SSL certificate increases phishing risk. Domain Age: New domains (less than 6 months old) are often created for phishing before being flagged. Domain Expiry: If a domain expires in ≤ 1 year, it’s a red flag, as legitimate businesses usually register domains for longer periods. WHOIS Record Availability: If WHOIS data is missing, it may indicate a fraudulent site. Website Structure \u0026amp; Content # Favicon Source: If the website’s favicon (icon in browser tab) is loaded from an external domain, it might be phishing. Non-Standard Ports: Phishing sites may use uncommon ports instead of standard ones like 80 (HTTP) or 443 (HTTPS). \u0026ldquo;HTTPS\u0026rdquo; in Domain Name: If the domain itself contains \u0026ldquo;https\u0026rdquo; (e.g., \u0026ldquo;https-secure-login.com\u0026rdquo;), it’s likely deceptive. External Requests \u0026amp; Links: A high percentage of external requests (images, scripts) can indicate a phishing attempt. Too many external anchor links (clickable links) also suggest the page is redirecting users elsewhere. If external links are placed inside \u0026lt;meta\u0026gt;, \u0026lt;script\u0026gt;, or \u0026lt;link\u0026gt; tags, it’s another suspicious sign. User Interaction \u0026amp; Behavior # Server Form Handler (SFH): If form data is sent to a different domain or an empty handler, the site may be stealing credentials. Submitting Info to Email: If the page sends data via mailto: or uses PHP’s mail() function, it might be phishing. Abnormal URL Format: A legitimate site’s domain should match the actual hostname. If it doesn’t, something’s off. Website Forwarding: Legitimate: 0-1 redirects Suspicious: 2-3 redirects Phishing: 4+ redirects Status Bar Manipulation: If the site modifies the status bar (e.g., using JavaScript onMouseOver tricks), it’s likely phishing. Disabling Right-Click: Phishing sites often disable right-click to prevent users from inspecting elements or copying text. Popup Windows with Input Fields: If a popup contains form fields, it might be trying to capture login credentials. Iframe Usage: Phishing sites frequently use \u0026lt;iframe\u0026gt; tags to embed malicious content from another source. Reputation \u0026amp; Popularity # Website Traffic: Sites ranked below 100,000 (by Tranco) are usually legitimate, while low-traffic sites are more suspicious. PageRank Score: If the site has a low PageRank (\u0026lt; 0.2), it’s a potential phishing risk. Google Indexing: If a website isn’t indexed by Google, it might be unsafe. Backlinks (External Links Pointing to Site): Legitimate: More than 2 backlinks Suspicious: 1-2 backlinks Phishing: 0 backlinks Statistical Reports-Based Analysis: If the domain appears in phishing databases (like PhishTank), it’s almost certainly malicious. By extracting all these features, the app builds a dataset that helps identify phishing attempts with greater accuracy. These signals work together to detect patterns commonly found in fraudulent sites, improving our ability to flag suspicious URLs before they cause harm.\nTraining the Machine Learning Model #Detecting phishing websites requires a strong classification model capable of identifying deceptive patterns in URLs. I chose a Random Forest classifier, a powerful ensemble learning algorithm that effectively handles complex data structures while offering interpretability. Below is a breakdown of the process I followed:\n1. Exploratory Data Analysis (EDA) #Before diving into model training, I conducted an Exploratory Data Analysis (EDA) to better understand the dataset and its characteristics.\nDataset Overview #The dataset has been downloaded from UCI Machine Learning Repository donated by R. Mohammad and L. McCluskey in 2015. It\u0026rsquo;s not defined how they collected the data but the features are well documented. The dataset contains 11,055 URLs, with 6,157 labeled as phishing (1) and 4,898 as legitimate (-1). This slight class imbalance means phishing websites are slightly more prevalent, but still close enough that it doesn’t require drastic resampling techniques.\nTo get a clearer picture, I analyzed the dataset’s features and correlations to identify the strongest indicators of phishing behavior.\nhaving_IP_AddressURL_LengthShortining_Servicehaving_At_Symboldouble_slash_redirectingPrefix_Suffixhaving_Sub_DomainSSLfinal_StateDomain_registeration_lengthFaviconportHTTPS_tokenRequest_URLURL_of_AnchorLinks_in_tagsSFHSubmitting_to_emailAbnormal_URLRedirecton_mouseoverRightClickpopUpWidnowIframeage_of_domainDNSRecordweb_trafficPage_RankGoogle_IndexLinks_pointing_to_pageStatistical_reportResult0-1111-1-1-1-1-111-11-11-1-1-101111-1-1-1-111-1-1111111-101-111-110-1-11101111-1-10-1111-1210111-1-1-1-111-110-1-1-1-1011111-11-110-1-1310111-1-1-1111-1-100-11101111-1-11-11-11-1410-111-111-1111100-1110-11-11-1-10-11111 Feature Correlations # The heatmap analysis shows the correlation between each feature and the target variable (phishing or legitimate). Key observations include:\nSSL Certificate Validity: A strong positive correlation (0.71) indicates that websites with a valid SSL certificate are much more likely to be legitimate. Phishing sites often lack proper certificates. Anchor Tags Linking to External Domains: A high positive correlation (0.69) suggests that phishing sites tend to have a high percentage of outbound anchor links redirecting users to different domains. Presence of Subdomains: A moderate positive correlation (0.30) indicates that the presence of multiple subdomains can be a sign of phishing activity. Prefix/Suffix in Domain: A moderate positive correlation (0.35) suggests that the presence of hyphens or other prefixes/suffixes in the domain name can be indicative of phishing. Request URLs from External Sources: A positive correlation (0.25) suggests that a higher proportion of externally loaded resources (images, scripts) can be a red flag. Domain Registration Length: A negative correlation (-0.23) suggests that phishing sites are more likely to have shorter domain registration periods. This heatmap effectively visualizes the relationships between features and the likelihood of a website being phishing, highlighting the most influential factors.\nHandling Imbalanced Data #Although the dataset isn’t severely imbalanced, phishing URLs slightly outnumber legitimate ones. To ensure the model learned effectively from both classes, I applied class weighting instead of oversampling or undersampling. This prevents the model from being biased toward the majority class.\n2. Model Selection \u0026amp; Training #To find the best model for phishing detection, I tested several algorithms using LazyPredict, an automated benchmarking tool. The top-performing models included:\nModel Accuracy Balanced Accuracy ROC AUC F1 Score Extra Trees Classifier 97.6% 97.6% 97.6% 97.6% Random Forest 97.4% 97.3% 97.3% 97.4% XGBoost 97.3% 97.2% 97.2% 97.3% 🔹 Why Random Forest?\nWhile Extra Trees performed slightly better, Random Forest provided comparable accuracy while being easier to interpret. It also handles overfitting well by averaging multiple decision trees, ensuring robust performance on new data.\nCross-Validation #To validate the model’s stability, I performed 5-fold cross-validation, which confirmed a mean accuracy of 97.1%. This consistency across different splits of the dataset indicated that the model generalizes well.\n3. Model Evaluation #Once trained, I evaluated the Random Forest classifier on the test set, which contained 2,211 URLs. The results were impressive:\n✔️ Accuracy: 98% – The model correctly classified 98% of phishing and legitimate URLs.\n✔️ Precision: 98% – Out of all URLs classified as phishing, 98% were truly phishing sites.\n✔️ Recall: 98% – The model successfully detected 98% of actual phishing URLs.\n✔️ F1 Score: 98% – A high balance between precision and recall.\n✔️ ROC-AUC Score: 98% – Indicates strong performance in distinguishing between phishing and legitimate sites.\nConfusion Matrix Analysis #A confusion matrix helps visualize the model’s performance:\nPredicted Legitimate (-1) Predicted Phishing (1) Actual Legitimate (-1) 951 (True Negatives) 29 (False Positives) Actual Phishing (1) 21 (False Negatives) 1210 (True Positives) 🔹 False Positives (29 cases): These are legitimate URLs incorrectly flagged as phishing. A lower false positive rate reduces unnecessary user frustration.\n🔹 False Negatives (21 cases): These are phishing URLs incorrectly classified as legitimate. Minimizing false negatives is crucial since missing a phishing attempt can lead to security breaches.\nPrecision vs. Recall Tradeoff # A high precision means the model makes fewer false accusations (legitimate sites misclassified as phishing). A high recall means the model catches more phishing sites but might flag some legitimate ones by mistake. With both at 98%, the model achieves an excellent balance. Building the Web App with Flask #Once the model was trained, I built a Flask web application to allow users to interact with it. The app consists of:\nFrontend (HTML, CSS): A simple UI where users enter a URL. Backend (Flask API): The /predict endpoint receives the URL input. The FeatureExtractor class extracts relevant features. The Random Forest model predicts whether the URL is phishing. Results are returned as a JSON response. Flask predict route:\n@app.route(\u0026#34;/predict\u0026#34;, methods=[\u0026#34;POST\u0026#34;]) def predict(): try: data = request.get_json() if not data or \u0026#34;url\u0026#34; not in data: return jsonify({\u0026#34;success\u0026#34;: False, \u0026#34;message\u0026#34;: \u0026#34;No URL provided.\u0026#34;}), 400 url = data[\u0026#34;url\u0026#34;] # Preprocess the input data extractor = FeatureExtractor(url) X_processed = extractor.extract_all_features() features = parse_features(X_processed) # Make prediction prediction = model.predict(X_processed) probability = model.predict_proba(X_processed) probability = np.max(probability) return jsonify({ \u0026#34;success\u0026#34;: True, \u0026#34;prediction\u0026#34;: int(prediction[0]), \u0026#34;probability\u0026#34;: probability, \u0026#34;features\u0026#34;: features }) except Exception as e: logging.error(f\u0026#34;Error: {e}\u0026#34;) status_code = extract_status_code(str(e)) if status_code: return jsonify({\u0026#34;success\u0026#34;: False, \u0026#34;message\u0026#34;: status_code}), 500 else: return jsonify({\u0026#34;success\u0026#34;: False, \u0026#34;message\u0026#34;: \u0026#34;Invalid URL\u0026#34;}), 500 This API enables real-time URL classification, making phishing detection accessible to users.\nKey Directories and Files:\napp/: Contains Flask application files. static/: Static assets like CSS and JavaScript. templates/: HTML templates. __init__.py: Initializes the Flask app and caching. routes.py: Defines Flask routes and prediction logic. data/: Data storage. raw/: Original, unprocessed data. processed/: Cleaned and processed data. external/: External datasets or resources. notebooks/: Jupyter notebooks for exploration and modeling. src/: Source code for ML pipelines. feature_pipeline.py: Feature engineering and selection. model_pipeline.py: Model training and evaluation. inference_pipeline.py: Data inference for direct predict in console. config.py: Configuration parameters. utils.py: Utility functions. models/: Serialized models and pipelines. phishing_model.pkl: Trained machine learning model. reports/: Documentation and reports. requirements.txt: Python dependencies. setup.py: Package setup script. run_pipeline.py: Script to execute ML pipelines. run_app.py: Script to start the Flask application. Dockerfile: Docker configuration for containerization. .gitignore: Specifies files and directories to ignore in Git. README.md: Project documentation. App Interface #Here’s how the application looks in action:\n1. Inputting a URL #The main interface provides a simple input field where users can enter a URL to check for phishing threats.\n2. Scanning the URL #Once the URL is submitted, the app processes the request and returns a prediction. Below, the URL \u0026ldquo;randolphrogers.me\u0026rdquo; has been classified as safe with 95.00% probability.\n3. Debug View #For deeper insights, a debug version shows a breakdown of all extracted features and their individual scores, giving transparency to the classification process.\nResults and Analysis #After successfully building the app and feature extraction pipeline, I tested the model on completely new data, including confirmed phishing sites. However, the results were disappointing. The model, which had performed almost perfectly during evaluation, struggled to correctly classify phishing sites in real-world scenarios.\nIdentifying the Issue: Overfitting or Dataset Limitations? #At first, I suspected overfitting. I revisited my training and testing procedures, but all performance metrics suggested a well-trained model. To further investigate, I created a new holdout dataset, simulating real-world conditions, and evaluated the model again. The results? Excellent performance, just like in training.\nThis raised a critical question: Why did the model fail on actual phishing sites but perform well on test data?\nDebugging with Feature Inspection #Using the app’s debug mode, I manually examined the results of every incorrectly classified phishing site, comparing their feature values with what the model had learned. This led to a key discovery:\nEvery new phishing website followed almost all of the most important phishing detection features from my dataset.\nThe real issue became evident. The dataset I used for training was obsolete.\nThe Cybersecurity Arms Race: Why Fresh Data Matters #In cybersecurity, there is an ongoing race between attackers (red team) and defenders (blue team). New phishing techniques emerge as security measures evolve, and old detection patterns become ineffective. My dataset was outdated, meaning the model had learned to detect past phishing trends rather than the latest threats.\nSeeking Updated Data: A New Dataset, New Challenges #After realizing this, I searched for a more recent dataset. The best I found was collected two years later than my original dataset. However, it had only 17 features compared to my 30. I retrained and tested the model using this dataset, and while the results were slightly weaker, they were still comparable.\nThis confirmed that while data freshness is critical, feature richness also plays a huge role in maintaining strong model performance.\nLimitations of Modern Data Collection #One of the biggest challenges in cybersecurity-related machine learning is access to up-to-date data. Many sources that previously provided useful insights are no longer available.\nFor example:\nAlexa Internet, which provided web traffic rankings for millions of websites, was shut down in 2022 Several key threat intelligence databases now restrict access behind costly APIs or enterprise-level services Many features from my original dataset are now harder to extract due to increased security measures on websites As a side project, these costs are prohibitively high, making it difficult to continuously update and improve the model.\nReflections: What This Project Taught Me #While the results were not what I expected, this project turned out to be a valuable learning experience. It forced me to\nReevaluate my training process and test my model under more realistic conditions Develop alternative evaluation methods to simulate real-world data Think critically about data validity, rather than just model accuracy This experience reinforced a key lesson. In cybersecurity, models are only as good as the data they are trained on.\nAdditionally, I realized that the probability score displayed by my model might not be calibrated properly. Users might interpret it differently than what the model actually represents. A probability calibration step could improve interpretability.\nChallenges and Lessons Learned #Every project presents obstacles, and this one was no different. Here are some key challenges I faced\nOutdated Training Data. The dataset I used was no longer effective in identifying modern phishing attacks Limited WHOIS Data. WHOIS records were often incomplete, limiting domain age analysis Balancing Model Performance. Reducing false positives was crucial. Incorrectly flagging legitimate sites could create user frustration Access to Fresh Data. Many useful data sources are now restricted behind paid services, limiting feature extraction capabilities Despite these challenges, I gained invaluable insights into both machine learning in cybersecurity and the importance of continuously evolving datasets\nFuture Improvements #There is always room for improvement. Here are a few areas I would like to explore next\n✅ Use Deep Learning. Experiment with neural networks for improved classification accuracy\n✅ Enhance Feature Engineering. Explore new feature extraction techniques, especially from webpage content analysis\n✅ Integrate Threat Intelligence. Cross-check URLs against real-time phishing databases for better validation\n✅ Deploy as a Browser Extension. Allow users to check URLs directly from their browsers, making the tool more accessible\n✅ Calibrate Model Probability Scores. Ensure displayed probabilities reflect actual confidence levels rather than misleading users\nConclusion #This project was an exciting blend of cybersecurity and machine learning, allowing me to build a practical tool that can help users stay safe online. By extracting key features from URLs and using a trained model for classification, the app provides an automated phishing detection system\nHowever, the biggest takeaway was not about model accuracy. It was about data relevance. No matter how advanced a machine learning model is, if it is trained on outdated information, its predictions will become unreliable over time\nMoving forward, I aim to explore more dynamic methods for continuously updating and adapting phishing detection models.\nThank you for reading. If you are interested in similar projects or have suggestions for enhancements, feel free to reach out.\nKeywords: Phishing Detection, Machine Learning, Flask, Cybersecurity, URL Classification\n","date":"26 September 2024","permalink":"/portfolio/phishing-domain-classifier/","section":"Portfolio","summary":"Introduction #Phishing attacks are one of the most common cybersecurity threats today, tricking users into providing sensitive information through malicious websites.","title":"Phishing Domain Classifier"},{"content":"","date":null,"permalink":"/tags/covid-19/","section":"Tags","summary":"","title":"Covid-19"},{"content":"","date":null,"permalink":"/tags/nlp/","section":"Tags","summary":"","title":"NLP"},{"content":"","date":null,"permalink":"/tags/sentiment-analysis/","section":"Tags","summary":"","title":"Sentiment Analysis"},{"content":" Introduction #Social media has become a key platform for discussions on global issues, and the Covid-19 pandemic was no exception. Millions of users shared their opinions on Twitter regarding Covid-19 vaccines, ranging from strong approval to skepticism and misinformation. To understand these opinions better, I conducted a Twitter Covid Vaccine Sentiment Analysis using Natural Language Processing (NLP) for an assignment during my bachelor\u0026rsquo;s degree.\nThis project aimed to explore how the public reacted to Covid-19 vaccines over time, which vaccines were more favored, and how misinformation played a role in shaping discussions. In this blog, I’ll walk you through the data collection process, sentiment analysis techniques, and key insights obtained from over 614,000 tweets related to Covid-19 vaccines.\nData Collection \u0026amp; Preprocessing #1. Data Source #The dataset was obtained from Kaggle, which contained tweets about Covid-19 vaccines collected by different users. The dataset consisted of two main sources:\nCovid Vaccine Tweets COVID-19 All Vaccines Tweets These datasets were merged, resulting in a final dataset of 614,074 tweets spanning from January 2020 to April 2022. The dataset provided an extensive snapshot of public sentiment throughout different stages of the pandemic, including vaccine development, approvals, and rollouts.\niduser_nameuser_locationuser_descriptionuser_createduser_followersuser_friendsuser_favouritesuser_verifieddatetexthashtagssourceretweetsfavoritesis_retweet01340539111971516416Rachel RohLa Crescenta-Montrose, CAAggregator of Asian American news; scanning diverse sources 24/7/365. RT\\'s, Follows and \\'Likes\\' will fuel me 👩\\u200d💻2009-04-08 17:52:4640516923247False2020-12-20 06:06:44Same folks said daikon paste could treat a cytokine storm #PfizerBioNTech https://t.co/xeHhIMg1kF[\\'PfizerBioNTech\\']Twitter for Android00False11338158543359250433Albert FongSan Francisco, CAMarketing dude, tech geek, heavy metal \u0026amp; \\'80s music junkie. Fascinated by meteorology and all things in the cloud. Opinions are my own.2009-09-21 15:27:30834666178False2020-12-13 16:27:13While the world has been on the wrong side of history this year, hopefully, the biggest vaccination effort we\\'ve ev… https://t.co/dlCHrZjkhmNaNTwitter Web App11False 2. Preprocessing Steps #Before applying sentiment analysis, the data underwent extensive cleaning and transformation to remove noise and standardize text for analysis. The following steps were implemented:\nEliminating Duplicate Tweets and Bot-Generated Content: To avoid skewing results. Removing URLs, mentions, and hashtags to focus only on the textual content. Tokenization: Splitting sentences into individual words. Lemmatization: Converting words to their base forms (e.g., “running” → “run”). Removing Stop Words: Filtering out common words like “the,” “and,” and “is” that don’t contribute to sentiment. Handling Special Characters and Emojis: Converting emojis into text representations to retain sentiment. For these tasks, Python libraries such as TextBlob, NLTK, pandas, and NeatText were used. The goal was to create a dataset that accurately reflects human sentiment without irrelevant data points affecting the results. After cleaning the data, the resulting dataset consist of 482,523 tweets.\nSentiment Analysis Methodology #1. Sentiment Classification #Each tweet was classified into one of three sentiment categories:\nPositive: Favorable opinions about Covid-19 vaccines. Neutral: Informational or non-opinionated tweets. Negative: Skepticism, misinformation, or distrust toward vaccines. This classification was done using TextBlob, a Python library that assigns polarity scores to text:\nPolarity ranges from -1 (negative) to +1 (positive). A polarity score \u0026gt;0 is considered positive, \u0026lt;0 is negative, and 0 is neutral. 2. Subjectivity Analysis #We also measured subjectivity, which determines how factual vs. opinionated a tweet is. Subjectivity scores helped distinguish factual news reports from personal opinions, allowing us to see how much of the vaccine discourse was based on emotions rather than verifiable facts.\nSentiment Analysis with TextBlob #This Python function leverages the TextBlob library to analyze the sentiment of a given text input. It returns a dictionary containing the polarity, subjectivity, and overall sentiment classification of the text.\nfrom textblob import TextBlob def analyze_sentiment(text): analysis = TextBlob(text) polarity = analysis.sentiment.polarity subjectivity = analysis.sentiment.subjectivity if polarity \u0026gt; 0: sentiment = \u0026#39;Positive\u0026#39; elif polarity == 0: sentiment = \u0026#39;Neutral\u0026#39; else: sentiment = \u0026#39;Negative\u0026#39; result = { \u0026#39;polarity\u0026#39;: polarity, \u0026#39;subjectivity\u0026#39;: subjectivity, \u0026#39;sentiment\u0026#39;: sentiment } return result Key Findings #1. Overall Sentiment Distribution # The dataset showed the following sentiment distribution:\n42.6% Positive 43.8% Neutral 13.6% Negative This indicates that while the majority of tweets were neutral, positive sentiment toward vaccines slightly outweighed negative sentiment. This is an encouraging insight, showing that, despite vaccine hesitancy and misinformation, social media users were largely supportive or at least informative about vaccines.\n2. Vaccine-Specific Sentiment #The sentiment scores for different vaccines were as follows:\nVaccine Polarity Subjectivity Pfizer 0.1163 0.3176 AstraZeneca 0.114 0.2685 Sputnik 0.1082 0.3041 Covaxin 0.1080 0.2541 Moderna 0.1047 0.2954 Pfizer had the highest acceptance based on polarity. Moderna had the lowest polarity but was still above 0, indicating positive sentiment overall. Covaxin had the lowest subjectivity, meaning more objective statements were made about it. These results reflect how different vaccines were received by the public and provide insights into brand trust and perception.\n3. Time-Series Analysis of Sentiment # Analyzing sentiment over time revealed key trends:\nEarly 2020 had low tweet activity about vaccines due to the lack of available information. Sentiment spiked in December 2020, aligning with the release of Pfizer’s vaccine under EUA. The highest spike in sentiment occurred in August 2021, coinciding with the approval of the third dose in the U.S. 4. Most Common Words in Sentiment Categories #Using word clouds, we identified frequently used words in different sentiment categories:\nPositive Words: # Vaccine, Efficient, Thankful, Safe, Amazing, Voluntary Negative Words: # Dangerous, Scared, Misinformation, Side-effects, Risky Neutral Words: # Vaccine, Doses, Health, Available, Announcement Challenges \u0026amp; Limitations #While the analysis provided valuable insights, it also faced some limitations:\nBias in Twitter Data: The dataset may not represent the global population’s opinion. Irony \u0026amp; Sarcasm Detection: Some tweets with sarcasm may have been misclassified. Bot-Generated Tweets: Despite filtering, some automated tweets could have influenced results. Conclusion \u0026amp; Takeaways #This project provided a data-driven perspective on public sentiment toward Covid-19 vaccines, highlighting key trends and reactions. The main takeaways are:\nPublic sentiment was largely neutral to positive. Pfizer had the most positive perception among vaccines. Sentiment spiked during key vaccine approval milestones. Understanding public opinion is crucial for public health campaigns, combating misinformation, and improving vaccine distribution strategies. Future improvements could include deep learning sentiment models and real-time analysis of vaccine perception.\nThank you for reading! If you have any questions or comments, please feel free to contact me. Your feedback is highly appreciated.\nKeywords: NLP, Sentiment Analysis, Covid, Machine Learning, Data Science\n","date":"5 August 2022","permalink":"/portfolio/twitter-covid-vaccines-sentiment/","section":"Portfolio","summary":"Introduction #Social media has become a key platform for discussions on global issues, and the Covid-19 pandemic was no exception.","title":"Twitter Covid-19 Vaccine Sentiment Analysis"},{"content":"","date":null,"permalink":"/tags/twitter-data/","section":"Tags","summary":"","title":"Twitter Data"}]