Machine Learning Methods

Evaluating Personal Job Market Prospects in 2024

Author

Anu Sharma, Cindy Guzman, Gavin Boss

Published

October 11, 2025

1 Overview

The analysis examines trends in Business Analytics, Data Science, and Machine Learning job postings, with a focus on the skills required for these roles. The study evaluates how varying skill combinations influence salary levels, remote work availability, and career progression pathways.

This analysis applies three machine learning approaches to job posting data: clustering to group roles by skill requirements, regression to examine how skills and experience influence salary, and classification to distinguish ML/Data Science positions from Business Analytics and other jobs. Using 25 technical skills along with experience and remote work indicators, the analysis shows that Business Analytics dominates the market (35% of roles), while ML and DS remain smaller but specialized segments. Results highlight that experience is the strongest salary driver, jobs fall into six clear clusters with different pay and remote work patterns, and BA, ML, and DS roles each display distinct skill signatures that make them easy to differentiate

2 Data Loading and Setup

The analysis starts by loading the Lightcast job postings dataset and identifying relevant skill columns. The dataset contains comprehensive information about job postings including titles, salaries, required skills, and other job characteristics.

Code

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import json
import re
from collections import Counter

pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"

# Load data from csv
df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)
print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")

# print(df.head())

Dataset loaded: 72,498 rows, 131 columns

2.1 Important Skills columns

The dataset contains multiple skill-related columns. After examining the schema, the columns ‘SKILLS_NAME’, ‘SOFTWARE_SKILLS_NAME’ and ‘SPECIALIZED_SKILLS_NAME’ provide the most detailed skill information for this analysis. These columns list the specific technical skills mentioned in each job posting.

3 Skills Data Preprocessing

The next step involves filtering the data to include only records with valid salary and title information. Then, binary features are created for 25 key technical skills covering ML, Data Science, and Business Analytics domains to enable machine learning analysis.

Code

# Apply filters
df_filtered = df.dropna(subset=['SALARY', 'TITLE'])

# Convert salary to numeric and filter
df_filtered['SALARY'] = pd.to_numeric(df_filtered['SALARY'], errors='coerce')
df_filtered = df_filtered[df_filtered['SALARY'] > 0]

print(f"Records after filtering: {len(df_filtered):,}")

df_skills = df_filtered.copy()

# Focus on key Business Analytics/ML/Data Science skills. Key skills for
# BA/ML/DS roles identified manually.
key_skills =  [
        'Python (Programming Language)',
        'R (Programming Language)',
        'SQL (Programming Language)',
        'Machine Learning',
        'Data Science',
        'Data Analysis',
        'Statistics',
        'Artificial Intelligence',
        'TensorFlow',
        'PyTorch (Machine Learning Library)',
        'Pandas (Python Package)',
        'NumPy (Python Package)',
        'Scikit-Learn (Python Package)',
        'Big Data',
        'Apache Spark',
        'Apache Hadoop',
        'Amazon Web Services',
        'Microsoft Azure',
        'Google Cloud Platform (Gcp)',
        'Data Visualization',
        'Tableau (Business Intelligence Software)',
        'Power BI',
        'Natural Language Processing (NLP)',
        'Computer Vision',
        'Deep Learning'
    ]

print(f"Using focused {len(key_skills)} BA/ML/DS technical skills for analysis")

# Create binary features for each key skill.
for skill in key_skills:
    # Clean skill name for column naming
    # Eg: R (Programming Language) --> has_r_programming_language
    skill_col_name = f'has_{skill.lower().replace(" ", "_").replace("-", "_").replace("(", "").replace(")", "")}'


    df_skills[skill_col_name] = (
        df_skills['SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) |
        df_skills['SOFTWARE_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) |
        df_skills['SPECIALIZED_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False)
    ).astype(int)

print("Binary skill features created")

# Create ML/DS role indicator using focused skills
core_ml_skills = [
    'has_machine_learning', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library',
    'has_deep_learning', 'has_natural_language_processing_nlp', 'has_computer_vision'
]

core_ds_skills = [
    'has_python_programming_language', 'has_r_programming_language', 'has_statistics',
    'has_data_science', 'has_pandas_python_package', 'has_numpy_python_package',
    'has_scikit_learn_python_package', 'has_big_data'
]

core_ba_skills = [
    'has_data_analysis', 'has_data_visualization', 'has_sql_programming_language',
    'has_tableau_business_intelligence_software', 'has_power_bi'
]

# Role indicators
# ML roles are straightforward.
df_skills['is_ml_role'] = (
    (df_skills[core_ml_skills].sum(axis=1) > 0)
).astype(int)

# R language is primarily associated with Data Science field. So,
# if job requires R language or if it has more than one data science
# skills then it is considered DS role.
df_skills['is_ds_role'] = (
    df_skills['has_r_programming_language'] == 1 | (df_skills[core_ds_skills].sum(axis=1) > 1)
).astype(int)

# Business Analytics roles typically require SQL, visualization tools (Tableau, Power BI)
# and data analysis capabilities. If job has more than two BA skills, consider it a BA role.
df_skills['is_ba_role'] = (
    df_skills[core_ba_skills].sum(axis=1) >= 2
).astype(int)

# Remote work indicator
df_skills['is_remote'] = df_skills['REMOTE_TYPE'].fillna(0).astype(int)
df_skills['experience_years'] = df_skills['MIN_YEARS_EXPERIENCE'].fillna(0)

df_final = df_skills
print(f"Final dataset size: {len(df_final):,}")
print(f"ML roles identified: {df_final['is_ml_role'].sum():,}")
print(f"Data Science roles identified: {df_final['is_ds_role'].sum():,}")
print(f"Business Analytics roles identified: {df_final['is_ba_role'].sum():,}")

Records after filtering: 30,808
Using focused 25 BA/ML/DS technical skills for analysis

Binary skill features created
Final dataset size: 30,808
ML roles identified: 3,226
Data Science roles identified: 2,877
Business Analytics roles identified: 10,831

For each of the 25 key skills, a binary indicator variable is created (1 if the skill is mentioned, 0 otherwise). This transforms the text skill data into numerical features suitable for machine learning models.

3.1 Role Classification Logic

Three role categories are identified based on technical skills:

ML roles: Require advanced ML/AI skills like TensorFlow, PyTorch, Deep Learning, NLP, Computer Vision
Data Science roles: Require R programming, Python with Statistics, or multiple data science tools (Pandas, NumPy, Scikit-learn)
Business Analytics roles: Require SQL, data analysis, visualization tools (Tableau, Power BI), typically 2+ BA skills

The analysis examines how these specialized skills impact salary and career opportunities. Machine learning models are used to find patterns that can guide job seekers in choosing which skills to develop.

4 Feature Engineering for ML

Before building models, the dataset is prepared by selecting relevant columns. This includes the salary (target variable), skill indicators, remote work status, and experience years.

Code

# Just prepare the modeling dataset
modeling_cols = ['SALARY', 'is_ml_role', 'is_ds_role', 'is_ba_role', 'is_remote', 'experience_years'] + \
            [col for col in df_final.columns if col.startswith('has_')]

df_modeling = df_final[modeling_cols].copy()

print("Features for modeling:")
print(f"Dataset shape: {df_modeling.shape}")
print(f"Columns: {list(df_modeling.columns)}")
print(f"Missing values: {df_modeling.isnull().sum().sum()}")

Features for modeling:
Dataset shape: (30808, 31)
Columns: ['SALARY', 'is_ml_role', 'is_ds_role', 'is_ba_role', 'is_remote', 'experience_years', 'has_python_programming_language', 'has_r_programming_language', 'has_sql_programming_language', 'has_machine_learning', 'has_data_science', 'has_data_analysis', 'has_statistics', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library', 'has_pandas_python_package', 'has_numpy_python_package', 'has_scikit_learn_python_package', 'has_big_data', 'has_apache_spark', 'has_apache_hadoop', 'has_amazon_web_services', 'has_microsoft_azure', 'has_google_cloud_platform_gcp', 'has_data_visualization', 'has_tableau_business_intelligence_software', 'has_power_bi', 'has_natural_language_processing_nlp', 'has_computer_vision', 'has_deep_learning']
Missing values: 0

The modeling dataset now contains binary skill features, experience, remote work indicator, and salary information. This structured format allows application of various machine learning techniques.

5 Unsupervised Learning:

5.1 KMeans Clustering Based on Skills

The first machine learning approach uses KMeans clustering to discover natural groupings in the job market. This unsupervised technique groups jobs with similar skill profiles together, without using salary information. The goal is to see if jobs naturally segment into distinct categories based on their requirements.

Code

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, f1_score, confusion_matrix, classification_report

# Prepare features for clustering using skills and other features
skill_feature_cols = [col for col in df_modeling.columns if col.startswith('has_')]
print(f"Available skill features: {len(skill_feature_cols)}")

# Base clustering features
clustering_features = skill_feature_cols + ['experience_years', 'is_remote']

# Encode ONET and NAICS6.
le_onet = LabelEncoder()
df_modeling['onet_encoded'] = le_onet.fit_transform(df_final['ONET'].fillna('Unknown'))
clustering_features.append('onet_encoded')

le_naics = LabelEncoder()
df_modeling['naics_encoded'] = le_naics.fit_transform(df_final['NAICS6'].fillna('Unknown'))
clustering_features.append('naics_encoded')

# Prepare clustering data
X_cluster = df_modeling[clustering_features].fillna(0)

# Scale features
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

# KMeans clustering
kmeans = KMeans(n_clusters=6, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_cluster_scaled)
df_modeling['cluster'] = clusters

# print("Skills based clustering completed")
# print("Cluster centers:")
# for i, center in enumerate(kmeans.cluster_centers_):
#     print(f"Cluster {i}: {center}")

Available skill features: 25

The clustering model groups similar jobs together using skill patterns, experience requirements, and job characteristics. The algorithm assigns each job to one of 6 clusters. Now the characteristics of each cluster can be examined to understand what makes them distinct.

Code

# Analyze clustering.
cluster_summary = df_modeling.groupby('cluster').agg({
    'SALARY': ['count', 'mean'],
    'is_ml_role': 'mean',
    'is_ds_role': 'mean',
    'is_ba_role': 'mean',
    'is_remote': 'mean',
    'experience_years': 'mean'
}).round(2)

cluster_summary.columns = ['count', 'avg_salary', 'ml_role_pct', 'ds_role_pct', 'ba_role_pct',
                        'remote_percentage', 'avg_experience']
cluster_summary = cluster_summary.reset_index()

# Compute combined BA/ML/DS percentage on-the-fly
# A job has BA/ML/DS if it has any of the three role types
cluster_summary['ml_ds_ba_combined_pct'] = cluster_summary.apply(
    lambda row: ((df_modeling[df_modeling['cluster'] == row['cluster']][['is_ml_role', 'is_ds_role', 'is_ba_role']].sum(axis=1) > 0).mean()),
    axis=1
).round(2)

print("Skills based Cluster Summary:")
print(cluster_summary)

# Visualize cluster characteristics.
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=('Cluster Size', 'Average Salary', 'BA/ML/DS Role %',
                'Remote Work %', 'Avg Experience', 'Salary Distribution'),
    specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}],
        [{"type": "bar"}, {"type": "bar"}, {"type": "scatter"}]]
)

fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['count'], name="Count"), row=1, col=1)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_salary'], name="Avg Salary"), row=1, col=2)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ml_role_pct'], name="ML %"), row=1, col=3)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ds_role_pct'], name="DS %"), row=1, col=3)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ba_role_pct'], name="BA %"), row=1, col=3)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['remote_percentage'], name="Remote %"), row=2, col=1)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_experience'], name="Experience"), row=2, col=2)

# Salary distribution by cluster.
fig.add_trace(
    go.Scatter(
        x=df_modeling['cluster'],
        y=df_modeling['SALARY'],
        mode='markers',
        opacity=0.6,
        name="Jobs"
    ),
    row=2, col=3
)

fig.update_layout(
    height=650,
    showlegend=False,
    template="plotly_white",
    title={
        'text': "Skills-Based KMeans Clustering Results",
        'y': 0.98,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    margin=dict(t=80)
)
fig.show()

Skills based Cluster Summary:
   cluster  count  avg_salary  ml_role_pct  ds_role_pct  ba_role_pct  \
0        0    583   139707.42         0.60         0.26         0.70   
1        1  10189   144796.54         0.14         0.00         0.04   
2        2  13313   100969.83         0.01         0.01         0.28   
3        3   6573   108557.35         0.17         0.39         0.95   
4        4     77   140001.35         1.00         0.32         0.31   
5        5     73   117793.86         0.55         0.32         0.93   

   remote_percentage  avg_experience  ml_ds_ba_combined_pct  
0               0.44            4.45                   0.90  
1               0.25            7.80                   0.17  
2               0.39            2.00                   0.29  
3               0.48            3.27                   0.99  
4               0.34            4.23                   1.00  
5               0.56            3.01                   0.96

5.1.1 Insights from KMeans Clustering

The clustering analysis grouped jobs based on their skill requirements and characteristics. The analysis identified 6 distinct job clusters, each with different salary levels, remote work availability, and skill profiles.

Key Findings:

Business Analytics dominates: 10,831 BA roles vs. 3,226 ML and 2,877 DS
Cluster 0 (583 jobs, $140K): High-skill hybrid (60% ML, 26% DS, 70% BA)
Cluster 1 (10,189 jobs, $145K): Mostly general tech, only 17% BA/DS/ML, highest pay
Cluster 2 (13,313 jobs, $101K): Entry-level, lowest experience (2 yrs), BA-focused (28%)
Cluster 3 (6,573 jobs, $109K): BA-heavy (95%) with DS overlap (39%)
Cluster 4 (77 jobs, $140K): Pure ML specialists (100% ML), niche but high-paying
Cluster 5 (73 jobs, $118K): Hybrid roles (96% BA/DS/ML), most remote-friendly (56%)
Remote work: 25%–56% across clusters
Experience: 2.0–7.8 years, showing clear career progression

Career Implications:

Most opportunities: Business Analytics (SQL, Tableau, Power BI, visualization)
Highest pay + volume: Cluster 1 ($145K, 10K+ jobs) — general tech roles
Entry path: Cluster 2 ($101K, 13K jobs) — BA-focused, lowest experience needed
BA-focused growth: Cluster 3 ($109K) — strong BA demand with DS hybrid edge
Specialist track: Cluster 4 ($140K) — pure ML, fewer jobs but high pay
Hybrid advantage: Cluster 0 ($140K) and Cluster 5 ($118K, 56% remote) — multi-skill roles with flexibility

6 Supervised Learning:

6.1 Multiple Regression

The second approach uses supervised learning to predict salary based on skills and experience. Two regression models are trained: Multiple Linear Regression and Random Forest. This analysis identifies which skills and factors most strongly influence compensation.

Code

# Regression features.
# Focus on skills (not role labels) to understand how skills directly affect salary
regression_features = skill_feature_cols + ['experience_years', 'is_remote']

# Preparing regression data using salary as the target variable
X_reg = df_modeling[regression_features].fillna(0)
y_reg = df_modeling['SALARY']

X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train):,}")
print(f"Test set size: {len(X_test):,}")

# Scale features
scaler_reg = StandardScaler()
X_train_scaled = scaler_reg.fit_transform(X_train)
X_test_scaled = scaler_reg.transform(X_test)

# Multiple Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Random Forest Regression
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train_scaled, y_train)

print("Skills based regression models training completed")

# Regression statistics for Multiple linear regression
y_train_pred = lr.predict(X_train_scaled)

# Residuals and RSS
residuals = y_train - y_train_pred
rss = np.sum(residuals**2)
n = len(y_train)
k = len(regression_features)

# Model statistics for Multiple linear regression
mse_lr = rss / n
rmse_train_lr = np.sqrt(mse_lr)
r2_train_lr = r2_score(y_train, y_train_pred)
adj_r2_lr = 1 - (1 - r2_train_lr) * (n - 1) / (n - k - 1)

# AIC and BIC for Multiple linear regression
aic_lr = n * np.log(rss / n) + 2 * k
bic_lr = n * np.log(rss / n) + k * np.log(n)
log_likelihood_lr = -aic_lr/2 + k

# Random Forest statistics
y_train_pred_rf = rf_reg.predict(X_train_scaled)
r2_train_rf = r2_score(y_train, y_train_pred_rf)
rmse_train_rf = np.sqrt(mean_squared_error(y_train, y_train_pred_rf))
residuals_rf = y_train - y_train_pred_rf
rss_rf = np.sum(residuals_rf**2)

# Test set performance
y_test_pred_lr = lr.predict(X_test_scaled)
y_test_pred_rf = rf_reg.predict(X_test_scaled)
r2_test_lr = r2_score(y_test, y_test_pred_lr)
r2_test_rf = r2_score(y_test, y_test_pred_rf)
rmse_test_lr = np.sqrt(mean_squared_error(y_test, y_test_pred_lr))
rmse_test_rf = np.sqrt(mean_squared_error(y_test, y_test_pred_rf))

# Regression statistics table
print("\n=== REGRESSION MODEL STATISTICS ===\n")

regression_stats = pd.DataFrame({
    'Statistic': [
        'Intercept',
        'Number of Features',
        'Number of Observations (Train)',
        'R-squared (Training)',
        'R-squared (Test)',
        'Adjusted R-squared',
        'RMSE (Training)',
        'RMSE (Test)',
        'RSS (Residual Sum of Squares)',
        'MSE (Mean Squared Error)',
        'AIC',
        'BIC',
        'Log-Likelihood'
    ],
    'Multiple linear regression': [
        f"{lr.intercept_:.4f}",
        f"{k}",
        f"{n:,}",
        f"{r2_train_lr:.4f}",
        f"{r2_test_lr:.4f}",
        f"{adj_r2_lr:.4f}",
        f"${rmse_train_lr:,.2f}",
        f"${rmse_test_lr:,.2f}",
        f"{rss:,.2f}",
        f"{mse_lr:,.2f}",
        f"{aic_lr:.2f}",
        f"{bic_lr:.2f}",
        f"{log_likelihood_lr:.2f}"
    ],
    'Random Forest': [
        'N/A',
        f"{k}",
        f"{n:,}",
        f"{r2_train_rf:.4f}",
        f"{r2_test_rf:.4f}",
        'N/A',
        f"${rmse_train_rf:,.2f}",
        f"${rmse_test_rf:,.2f}",
        f"{rss_rf:,.2f}",
        f"{rmse_train_rf**2:,.2f}",
        'N/A*',
        'N/A*',
        'N/A*'
    ]
})

print(regression_stats.to_string(index=False))
print("\n* AIC/BIC/Log-Likelihood are only applicable to parametric linear models")
print("\nNote: R-squared (Test) shows model performance on unseen data")

print("\n=== FEATURE COEFFICIENTS / IMPORTANCE COMPARISON ===\n")

#  Sanity Check
assert len(regression_features) == len(lr.coef_) == len(rf_reg.feature_importances_), \
    "Mismatch between features, LR coefficients, and RF importances!"

# Combined DataFrame
coef_comparison = pd.DataFrame({
    'Feature': regression_features,
    'MLR_Coefficient': lr.coef_,
    'RF_Importance': rf_reg.feature_importances_
})

# Remove zero coefficients from MLR
coef_comparison = coef_comparison[coef_comparison['MLR_Coefficient'] != 0.0]

# Impact Type (Positive / Negative only)
coef_comparison['Impact'] = coef_comparison['MLR_Coefficient'].apply(
    lambda x: 'Positive' if x > 0 else 'Negative'
)

coef_comparison['MLR_Coefficient'] = coef_comparison['MLR_Coefficient'].round(4)
coef_comparison['RF_Importance'] = coef_comparison['RF_Importance'].round(4)

# Top 15 Positive Features by MLR Coefficient
top_positive = (
    coef_comparison[coef_comparison['Impact'] == 'Positive']
    .sort_values(by='MLR_Coefficient', ascending=False)
    .head(15)
)

print("Top 15 Features by Multiple Linear Regression Coefficient (Positive Impact):")
print(top_positive[['Feature', 'MLR_Coefficient', 'RF_Importance']].to_string(index=False))

#  Top 15 Negative Features by MLR Coefficient
top_negative = (
    coef_comparison[coef_comparison['Impact'] == 'Negative']
    .sort_values(by='MLR_Coefficient', ascending=True)
    .head(15)
)

print("\nTop 15 Features by Multiple Linear Regression Coefficient (Negative Impact):")
print(top_negative[['Feature', 'MLR_Coefficient', 'RF_Importance']].to_string(index=False))

# Top 15 Features by Random Forest Importance
top_rf = coef_comparison.sort_values(by='RF_Importance', ascending=False).head(15)

print("\nTop 15 Features by Random Forest Importance:")
print(top_rf[['Feature', 'RF_Importance', 'MLR_Coefficient']].to_string(index=False))

print(f"\nMultiple Linear Regression Intercept: {lr.intercept_:.4f}")

# Interpretation ---
print("\nNote:")
print("- MLR Coefficients show the direction and strength of linear relationships with the target.")
print("- Positive coefficients increase predicted salary; negative coefficients decrease it.")
print("- RF Importance reflects how much each feature contributes to model accuracy (non-linear).")
print("- RF does not provide directionality, but captures feature interactions and non-linear effects.")

Training set size: 24,646
Test set size: 6,162
Skills based regression models training completed

=== REGRESSION MODEL STATISTICS ===

                     Statistic Multiple linear regression         Random Forest
                     Intercept                117744.2020                   N/A
            Number of Features                         27                    27
Number of Observations (Train)                     24,646                24,646
          R-squared (Training)                     0.2678                0.5237
              R-squared (Test)                     0.2780                0.4672
            Adjusted R-squared                     0.2670                   N/A
               RMSE (Training)                 $38,730.71            $31,238.34
                   RMSE (Test)                 $37,899.01            $32,558.54
 RSS (Residual Sum of Squares)      36,970,667,997,179.88 24,050,395,305,765.03
      MSE (Mean Squared Error)           1,500,067,678.21        975,833,616.24
                           AIC                  520793.81                  N/A*
                           BIC                  521012.85                  N/A*
                Log-Likelihood                 -260369.91                  N/A*

* AIC/BIC/Log-Likelihood are only applicable to parametric linear models

Note: R-squared (Test) shows model performance on unseen data

=== FEATURE COEFFICIENTS / IMPORTANCE COMPARISON ===

Top 15 Features by Multiple Linear Regression Coefficient (Positive Impact):
                        Feature  MLR_Coefficient  RF_Importance
               experience_years       17850.7023         0.4932
has_python_programming_language        5665.9793         0.0300
        has_amazon_web_services        3354.0141         0.0361
                   has_big_data        3255.5115         0.0257
    has_artificial_intelligence        1905.1339         0.0239
            has_microsoft_azure        1587.5365         0.0192
           has_machine_learning        1501.0951         0.0265
               has_data_science        1297.6624         0.0261
      has_pandas_python_package         934.5885         0.0042
has_scikit_learn_python_package         733.9453         0.0003
              has_deep_learning         404.1220         0.0023
  has_google_cloud_platform_gcp         213.1297         0.0084
                      is_remote         201.2772         0.0728
            has_computer_vision         130.9562         0.0006
              has_apache_hadoop          70.5746         0.0075

Top 15 Features by Multiple Linear Regression Coefficient (Negative Impact):
                                   Feature  MLR_Coefficient  RF_Importance
                         has_data_analysis       -7222.0464         0.0426
                has_r_programming_language       -2811.3382         0.0193
                              has_power_bi       -1737.9884         0.0232
                            has_statistics       -1298.4926         0.0302
              has_sql_programming_language       -1087.0998         0.0350
      has_pytorch_machine_learning_library        -801.3560         0.0005
has_tableau_business_intelligence_software        -789.6728         0.0372
                  has_numpy_python_package        -635.2314         0.0004
                    has_data_visualization        -524.2779         0.0249
       has_natural_language_processing_nlp        -260.2604         0.0019
                            has_tensorflow        -256.5688         0.0005
                          has_apache_spark        -248.2513         0.0073

Top 15 Features by Random Forest Importance:
                                   Feature  RF_Importance  MLR_Coefficient
                          experience_years         0.4932       17850.7023
                                 is_remote         0.0728         201.2772
                         has_data_analysis         0.0426       -7222.0464
has_tableau_business_intelligence_software         0.0372        -789.6728
                   has_amazon_web_services         0.0361        3354.0141
              has_sql_programming_language         0.0350       -1087.0998
                            has_statistics         0.0302       -1298.4926
           has_python_programming_language         0.0300        5665.9793
                      has_machine_learning         0.0265        1501.0951
                          has_data_science         0.0261        1297.6624
                              has_big_data         0.0257        3255.5115
                    has_data_visualization         0.0249        -524.2779
               has_artificial_intelligence         0.0239        1905.1339
                              has_power_bi         0.0232       -1737.9884
                has_r_programming_language         0.0193       -2811.3382

Multiple Linear Regression Intercept: 117744.2020

Note:
- MLR Coefficients show the direction and strength of linear relationships with the target.
- Positive coefficients increase predicted salary; negative coefficients decrease it.
- RF Importance reflects how much each feature contributes to model accuracy (non-linear).
- RF does not provide directionality, but captures feature interactions and non-linear effects.

Both models are trained on 80% of the data and will be evaluated on the remaining 20% test set. The Random Forest model can capture non-linear relationships and interactions between skills, while Multiple Linear Regression provides a baseline for comparison.

Code

# Test metrics already calculated above
r2_lr = r2_test_lr
r2_rf = r2_test_rf
rmse_lr = rmse_test_lr
rmse_rf = rmse_test_rf
y_pred_rf = y_test_pred_rf

print("Skills-based Regression Model Performance (Test Set):")
print(f"Multiple Linear Regression - RMSE: ${rmse_lr:,.2f}, R²: {r2_lr:.4f}")
print(f"Random Forest - RMSE: ${rmse_rf:,.2f}, R²: {r2_rf:.4f}")

# Feature importance for Random Forest
#Features that actually exist in the model
actual_feature_names = [col for col in regression_features if col in X_train.columns]
importances = rf_reg.feature_importances_

# Visualize feature importance
fig = px.bar(x=actual_feature_names, y=importances,
            title="Skills Impact on Salary (Random Forest Feature Importance)",
            labels={'x': 'Features', 'y': 'Importance'})
fig.update_layout(template="plotly_white", xaxis_tickangle=-45)
fig.show()

# Top skills by salary impact
skill_importance = list(zip(actual_feature_names, importances))
skill_importance.sort(key=lambda x: x[1], reverse=True)
print("\nTop skills by salary impact:")
for skill, importance in skill_importance[:10]:
    print(f"{skill}: {importance:.4f}")

Skills-based Regression Model Performance (Test Set):
Multiple Linear Regression - RMSE: $37,899.01, R²: 0.2780
Random Forest - RMSE: $32,558.54, R²: 0.4672


Top skills by salary impact:
experience_years: 0.4932
is_remote: 0.0728
has_data_analysis: 0.0426
has_tableau_business_intelligence_software: 0.0372
has_amazon_web_services: 0.0361
has_sql_programming_language: 0.0350
has_statistics: 0.0302
has_python_programming_language: 0.0300
has_machine_learning: 0.0265
has_data_science: 0.0261

6.1.1 Regression Analysis: What drives salary?

Prediction models were built to understand how skills influence salary. The Random Forest model achieved R2 of 0.47 compared to 0.28 for Multiple Linear Regression, showing that skill-salary relationships are complex.

Model Performance:

Random Forest: R² = 0.47 (explains 47% of salary variation), RMSE = $32,559
Multiple Linear Regression: R² = 0.28
Insight: Skills alone do not fully explain salary — other factors also matter.

Key Salary Drivers (Feature Importance):

Experience (0.49): Largest factor, nearly half of salary variation
Remote work (0.07): Flexibility influences pay differences
Data Analysis (0.04): Core analytical capability
Tableau (0.04): Visualization and BI tool
AWS (0.04): Cloud computing platform
SQL (0.04): Database querying and manipulation
Statistics (0.03): Analytical foundation
Python (0.03): Programming language

Career Implications:

Experience is critical — the strongest driver of salary.
Remote work adds value — flexibility can boost compensation.
Skill combinations matter — technical, analytical, and cloud skills together shape salary outcomes.

Summary: Salary is not determined by skills alone. Experience and work flexibility are key, while technical skills provide additional differentiation.

6.2 Classification to Identify BA/ML/DS Roles

Although the project required only one of the supervised learning models. This analysis also explores the classification to distinguish ML/Data Science roles from Business Analytics and other positions. A Random Forest Classifier is trained to predict whether a job is an ML/DS role based on its skill requirements. This analysis reveals which skills are the strongest “signature” indicators that distinguish ML/DS positions from BA roles.

Code

# Prepare features for classification.
classification_features = skill_feature_cols + ['experience_years', 'is_remote']

# Classification data
X_clf = df_modeling[classification_features].fillna(0)
# Target: ML/DS roles (computed from is_ml_role OR is_ds_role)
y_clf = ((df_modeling['is_ml_role'] == 1) | (df_modeling['is_ds_role'] == 1)).astype(int)

# Train/test split for classification
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)

# Scale features
scaler_clf = StandardScaler()
X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)
X_test_clf_scaled = scaler_clf.transform(X_test_clf)

# Random Forest Classification
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_clf_scaled, y_train_clf)

print("Skills-based classification model trained successfully!")

Skills-based classification model trained successfully!

The classifier learns patterns that distinguish ML/DS roles from BA and other positions based on their skill profiles. The model is now evaluated to see how accurately it can identify these specialized ML/DS roles versus the more common BA positions.

Code

# Random Forest predictions
y_pred_rf_clf = rf_clf.predict(X_test_clf_scaled)
accuracy_rf = accuracy_score(y_test_clf, y_pred_rf_clf)
f1_rf = f1_score(y_test_clf, y_pred_rf_clf)

print("Skills based Classification Model Performance:")
print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, F1 Score: {f1_rf:.4f}")

# Confusion Matrix for Random Forest
cm = confusion_matrix(y_test_clf, y_pred_rf_clf)

# Visualize confusion matrix
fig = px.imshow(cm, text_auto=True, aspect="auto",
                title="Confusion Matrix - ML/DS Role Classification",
                labels=dict(x="Predicted", y="Actual"),
                color_continuous_scale="Blues")

fig.update_layout(template="plotly_white")
fig.update_xaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])
fig.update_yaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])
fig.show()

print("Classification Report:")
print(classification_report(y_test_clf, y_pred_rf_clf))

# Features that actually exist in the classification model
clf_actual_feature_names = [col for col in classification_features if col in X_train_clf.columns]
clf_importances = rf_clf.feature_importances_

# Visualize classification feature importance
fig = px.bar(x=clf_actual_feature_names, y=clf_importances,
            title="Skills Impact on ML/Data Science Role Classification",
            labels={'x': 'Features', 'y': 'Importance'})
fig.update_layout(template="plotly_white", xaxis_tickangle=-45)
fig.show()

Skills based Classification Model Performance:
Random Forest - Accuracy: 0.9995, F1 Score: 0.9986

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5082
           1       1.00      1.00      1.00      1080

    accuracy                           1.00      6162
   macro avg       1.00      1.00      1.00      6162
weighted avg       1.00      1.00      1.00      6162

6.2.1 Classification Results: Identifying ML/Data Science Roles

A Random Forest classifier was used to predict whether a job is an ML/Data Science role based on its skill requirements. The model achieved very strong performance in separating ML/DS roles from Business Analytics and other positions.

Model Performance:

Accuracy: 99.95% — nearly all ML/DS roles correctly identified
Insight: ML/DS roles have distinct skill patterns compared to BA and general analyst jobs
Conclusion: Skill-based criteria effectively distinguish ML/DS roles from BA positions

Key Predictive Skills (Feature Importance)

Programming: Python, R
ML Frameworks: TensorFlow, PyTorch
Statistical Modeling: Core differentiator for ML/DS
BA-Oriented Skills: SQL, Tableau, Power BI, Data Analysis (more common in BA roles)

Career Implications

Distinct skill sets: ML/DS roles require clearly different capabilities than BA roles
ML/DS focus: Programming, modeling, and ML frameworks are the strongest signals
BA focus: SQL, visualization, and reporting tools dominate BA roles
Career development: Building expertise in high-importance ML/DS features directly improves readiness for ML/DS positions

Summary:The Random Forest classifier confirms that ML/DS roles are defined by specialized technical skills, while BA roles emphasize analysis and visualization tools. This distinction provides a clear roadmap for professionals aiming to transition into ML/DS careers.

7 Model Results Visualization

This section provides a consolidated view of the regression modeling approaches. The comparison shows how different models perform on salary prediction and highlights the most impactful skills across different analyses.

Code

# Model performance
model_summary = pd.DataFrame({
    'Model': ['Multiple Linear Regression', 'Random Forest (Regression)'],
    'R² (Test)': [r2_lr, r2_rf],
    'RMSE (Test)': [rmse_lr, rmse_rf]
})
print(model_summary)

# Visualization of model results
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('R² Comparison (Test Set)', 'RMSE Comparison (Test Set)',
                    'Skills vs Salary Impact', 'Predicted vs Actual Salary'),
    specs=[[{"type": "bar"}, {"type": "bar"}],
           [{"type": "bar"}, {"type": "scatter"}]]
)

# Row 1, Col 1: R² comparison
models = ['Multiple linear regression', 'Random Forest']
r2_values = [r2_lr, r2_rf]
fig.add_trace(go.Bar(x=models, y=r2_values, name="R² Score",
                     marker_color=['steelblue', 'darkgreen']), row=1, col=1)

# Row 1, Col 2: RMSE comparison
rmse_values = [rmse_lr, rmse_rf]
fig.add_trace(go.Bar(x=models, y=rmse_values, name="RMSE",
                     marker_color=['coral', 'orange']), row=1, col=2)

# Row 2, Col 1: Skills vs salary impact (top 10)
top_skills_salary = skill_importance[:10]
fig.add_trace(go.Bar(
    x=[s[1] for s in top_skills_salary],
    y=[s[0] for s in top_skills_salary],
    orientation='h',
    name="Feature Importance",
    marker_color='purple'), row=2, col=1)

# Row 2, Col 2: Predicted vs Actual for Random Forest
sample_size = min(500, len(y_test))
sample_indices = np.random.choice(len(y_test), sample_size, replace=False)
fig.add_trace(go.Scatter(
    x=y_test.iloc[sample_indices],
    y=y_pred_rf[sample_indices],
    mode='markers',
    name='RF Predictions',
    marker=dict(color='darkgreen', size=5, opacity=0.6)), row=2, col=2)

# Prediction line
min_val = min(y_test.min(), y_pred_rf.min())
max_val = max(y_test.max(), y_pred_rf.max())
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    name='Perfect Prediction',
    line=dict(color='red', dash='dash')), row=2, col=2)

# Axis labels
fig.update_xaxes(title_text="Model", row=1, col=1)
fig.update_yaxes(title_text="R² Score", row=1, col=1)
fig.update_xaxes(title_text="Model", row=1, col=2)
fig.update_yaxes(title_text="RMSE ($)", row=1, col=2)
fig.update_xaxes(title_text="Importance", row=2, col=1)
fig.update_yaxes(title_text="Feature", row=2, col=1)
fig.update_xaxes(title_text="Actual Salary ($)", row=2, col=2)
fig.update_yaxes(title_text="Predicted Salary ($)", row=2, col=2)

fig.update_layout(
    height=800,
    showlegend=False,
    template="plotly_white",
    title={
        'text': "Regression Model Comparison - BA/ML/DS Salary Prediction",
        'y': 0.98,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
    },
    margin=dict(t=80)
)
fig.show()

                        Model  R² (Test)   RMSE (Test)
0  Multiple Linear Regression   0.278032  37899.005358
1  Random Forest (Regression)   0.467166  32558.537199

8 Key Takeaways and Recommendations

8.1 Summary of Findings

Our analysis of business analytics, data science and machine learning job postings reveals several important patterns:

Role Distribution: Business Analytics dominates (35% of jobs), while ML and DS remain smaller but specialized segments.
Job Segmentation: Six distinct clusters reveal clear differences in pay, experience, and hybrid skill mixes.
Salary Drivers: Experience is the strongest factor (49%), with remote work and technical skills adding incremental impact.
Role Differentiation: ML/DS roles are highly distinct, with classification accuracy of 99.95% separating them from BA roles.

8.2 Recommendations for Job Seekers

For Career Advancement:

Gain experience - it’s the single biggest salary driver (49% importance)
Remote work flexibility - BA/ML/DS roles pay well even when remote, showing that onsite presence is not necessary for competitive salaries.
Learn practical tools: Data analysis (4.3%), Tableau (3.7%), AWS (3.6%), SQL (3.5%), Statistics (3.0%), Python (3.0%)
General technical roles (Cluster 1) pay highest ($145K) with most opportunities (10,189 jobs)

For Business Analytics Path:

Highest volume opportunity: 10,831 BA roles identified (35% of job market)
Core BA skills: SQL, Tableau/Power BI, data visualization, data analysis
Best BA cluster: Cluster 3 (6,573 jobs at $109K) with 95% BA roles
Hybrid advantage: Many BA roles overlap with DS (39% in Cluster 3), so learning Python/statistics opens DS opportunities

For Transitioning to ML/Data Science:

ML path (3,226 roles): Most specialized and competitive - requires TensorFlow, PyTorch, Deep Learning, NLP
DS path (2,877 roles): Requires R or Python + Statistics + multiple DS tools (Pandas, NumPy, Scikit-learn)
Pure ML roles (Cluster 4): Only 77 jobs at $140K - highly specialized
The 99.95% classification accuracy shows these roles need very specific skill combinations

For Maximizing Opportunities:

Most jobs + highest pay: Cluster 1 (10,189 jobs at $145K) - general technical roles, only 17% need BA/ML/DS
Entry-level: Cluster 2 (13,313 jobs at $101K) - 29% BA/ML/DS, lowest experience requirement (2.0 years)
BA opportunities: Cluster 3 (6,573 jobs at $109K) - 99% need BA/ML/DS (95% BA, 39% DS overlap)
Remote work: Cluster 5 (73 jobs at $118K, 56% remote) - 96% hybrid BA/ML/DS roles
High-skill hybrid: Cluster 0 (583 jobs at $140K) - 90% BA/ML/DS (60% ML + 70% BA combination)

8.3 Limitations and Considerations

The analysis is based on job posting data which may not reflect actual hiring outcomes
Skill requirements in job posts may differ from day-to-day job responsibilities
Market conditions and geographic factors also influence salaries beyond just skills