Exploratory Data Analysis (EDA)

Visual and Statistical Insights from Job Market Data

Author

Affiliation

Anu Sharma, Cindy Guzman, Gavin Boss

Boston University

Published

October 9, 2025

1 Overview

This Exploratory Data Analysis (EDA) examines job postings from Lightcast to uncover salary trends, experience requirements, remote work dynamics, and skill demand across Business Analytics (BA), Data Science (DS), and Machine Learning (ML) roles. Lightcast is a leading labor market analytics provider, whose datasets are widely used to study real-time hiring trends and skills demand (Lightcast (2024)). The analysis prepares the dataset for modeling and provides statistical and visual insights that guide the regression and feature engineering stages of the project.

Specifically, this section covers:

Data preparation and cleaning – handling duplicates, renaming key columns, converting data types, and computing an Average_Salary variable
Salary distribution and outliers – visualizing the spread of compensation and identifying high-paying ML and senior roles
Experience and salary relationships – analyzing how required years of experience influence pay across remote and hybrid work types
Role-based salary comparison – comparing median salaries for BA, DS, and ML roles
Remote work trends – exploring how work flexibility impacts compensation
Top skill frequencies – identifying the most in-demand technical and analytical skills in job descriptions
Feature correlations – assessing relationships between numeric variables such as salary and experience to inform model selection

Together, these insights establish a clear understanding of labor market patterns and help define which features are most predictive for the upcoming salary regression models.

2 Load and Prepare Data

Code

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
import plotly.colors as pc
import json
import re
from collections import Counter

pio.templates.default = "plotly_white"
pio.renderers.default = "iframe_connected"

# === Load data from CSV ===
df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)
# print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")

# --- Detect & drop duplicate columns ---
# --- Detect & fully clean duplicate-like columns ---
# Normalize column names: strip whitespace and hidden characters
df.columns = df.columns.str.strip().str.replace(r"\s+", " ", regex=True)

# Collapse exact duplicates after cleanup
before_cols = len(df.columns)
df = df.loc[:, ~df.columns.duplicated()]
after_cols = len(df.columns)

# print(f" Cleaned column names: removed {before_cols - after_cols} duplicate(s).")
# print("Unique columns now:", len(df.columns))

# --- Convert numeric columns safely ---
for col in ["SALARY_FROM", "SALARY_TO", "MIN_YEARS_EXPERIENCE", "MAX_YEARS_EXPERIENCE"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# --- Compute average salary (avoid string concat) ---
if {"SALARY_FROM", "SALARY_TO"}.issubset(df.columns):
    df["Average_Salary"] = df[["SALARY_FROM", "SALARY_TO"]].mean(axis=1, skipna=True)

# --- Clean and rename (safely) ---
rename_map = {
    "REMOTE_TYPE_NAME": "REMOTE_GROUP",
    "STATE_NAME": "STATE",
    "LOT_V6_OCCUPATION_GROUP_NAME": "ROLE_GROUP"
}

# Only rename columns that won't create duplicates
for old, new in rename_map.items():
    if old in df.columns and new not in df.columns:
        df.rename(columns={old: new}, inplace=True)
    elif old in df.columns and new in df.columns:
      # print(f" Skipping rename '{old}' → '{new}' to avoid duplicate column name.")
      pass


# --- Drop invalid rows early ---
if "Average_Salary" in df.columns:
    df = df[df["Average_Salary"].notna() & (df["Average_Salary"] > 0)]

# --- Downsample if dataset is large ---
if len(df) > 5000:
    df = df.sample(5000, random_state=42)

# print(f" Loaded {len(df)} rows safely with {len(df.columns)} unique columns.")

3 Data Preparation and Cleaning

The dataset used for this analysis was sourced from Lightcast job postings, containing thousands of listings across Business Analytics (BA), Data Science (DS), and Machine Learning (ML) roles. To ensure data quality and consistency:

Column Normalization: Extra spaces and hidden characters were stripped from column names to avoid mismatches and duplicates.

Duplicate Removal: Identical columns were collapsed to retain only unique fields.

Type Conversion: Numeric columns such as SALARY_FROM, SALARY_TO, MIN_YEARS_EXPERIENCE, and MAX_YEARS_EXPERIENCE were coerced into numeric format, with non-numeric values safely converted to NaN.

Average Salary Calculation: A new feature, Average_Salary, was computed as the mean of the salary range for each posting to simplify analysis.

Column Renaming: Key columns were standardized (REMOTE_TYPE_NAME → REMOTE_GROUP, STATE_NAME → STATE, LOT_V6_OCCUPATION_GROUP_NAME → ROLE_GROUP) for clarity.

Invalid Data Removal: Rows with missing or non-positive salaries were dropped.

Downsampling: For performance, the dataset was randomly reduced to 5,000 representative rows, preserving the statistical diversity of the original population.

This preprocessing established a clean, consistent dataset suitable for visualization and modeling.

Code

# Filter valid salaries
df_salary = df[df["Average_Salary"] > 0]

# --- Base histogram (same as your original) ---
fig = px.histogram(
    df_salary,
    x="Average_Salary",
    nbins=40,
    color_discrete_sequence=["#B03A73"], 
    title="Distribution of Average Salaries"
)

# --- Compute log overlay using numpy ---
log_salaries = np.log(df_salary["Average_Salary"])
counts, bins = np.histogram(log_salaries, bins=40)
bin_centers = (bins[:-1] + bins[1:]) / 2
log_curve_x = np.exp(bin_centers)

# Get reference histogram counts to scale properly
linear_counts, _ = np.histogram(df_salary["Average_Salary"], bins=40)
scale_factor = max(linear_counts) / max(counts)
log_curve_y = counts * scale_factor

# --- Add overlay line ---
fig.add_trace(go.Scatter(
    x=log_curve_x,
    y=log_curve_y,
    mode="lines",
    line=dict(color="#880E4F", width=3),  # deeper magenta for contrast
    name="Log-transformed Overlay"
))

# --- Keep your layout exactly the same ---
fig.update_layout(
    title=dict(
        text="<b>Distribution of Average Salaries (with Log Overlay)</b>",
        x=0.5,
        xanchor="center",
        font=dict(size=20)
    ),
    xaxis_title="<b>Average Salary (USD)</b>",
    yaxis_title="<b>Number of Job Postings</b>",
    template="plotly_white",
    width=900,
    height=550,
    bargap=0.05,
    font=dict(size=14),
    margin=dict(l=60, r=40, t=80, b=60),
    legend=dict(title="Legend", orientation="h", y=-0.2, x=0.3)
)

fig.update_xaxes(showgrid=True, gridcolor="lightgray", zeroline=False)
fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)

fig.show()

4 Salary Distribution

The salary distribution is right-skewed, indicating that most job postings fall within the $80K–$150K range. A smaller number of positions extend above $200K, reflecting higher-paying senior and Machine Learning roles. This pattern aligns with national data showing that advanced technical and AI-related positions command higher compensation due to specialized skill requirements (Bureau of Labor Statistics (2024); Bone, Ehlinger, and Stephany (2025)). Such variation underscores the growing salary dispersion across digital and data-driven occupations (Mäkelä and Stephany (2024)).

Code

# Filter out invalid or missing values
df_exp = df[(df["Average_Salary"] > 0) & (df["MIN_YEARS_EXPERIENCE"] >= 0)].copy()

# Combine '[None]' and 'Not Remote' into 'Onsite'
df_exp["REMOTE_GROUP"] = df_exp["REMOTE_GROUP"].replace({
    None: "Onsite",
    "[None]": "Onsite",
    "Not Remote": "Onsite"
})

#  Add gentle jitter to make overlapping points clearer 
jitter_strength = 0.2
df_exp["MIN_YEARS_EXPERIENCE_JITTER"] = df_exp["MIN_YEARS_EXPERIENCE"] + np.random.uniform(
    low=-jitter_strength, high=jitter_strength, size=len(df_exp)
)
df_exp["Average_Salary_JITTER"] = df_exp["Average_Salary"] + np.random.uniform(
    low=-2000, high=2000, size=len(df_exp)
)

# Use a rainbow-like palette manually 
rainbow_colors = pc.sample_colorscale("Rainbow", [0.0, 0.25, 0.5, 0.75, 1.0])

#  Create scatter plot with rainbow gradient colors 
fig = px.scatter(
    df_exp,
    x="MIN_YEARS_EXPERIENCE_JITTER",
    y="Average_Salary_JITTER",
    color="REMOTE_GROUP",
    trendline="ols",
    title="Salary vs. Minimum Experience by Remote Type",
    color_discrete_sequence=rainbow_colors,  
    opacity=0.7,
    height=550
)

fig.update_layout(
    title=dict(
        text="<b>Salary vs. Minimum Experience by Remote Type</b>",
        x=0.5,
        xanchor="center",
        font=dict(size=20)
    ),
    xaxis_title="<b>Minimum Years of Experience</b>",
    yaxis_title="<b>Average Salary (USD)</b>",
    template="plotly_white",
    width=900,
    height=550,
    font=dict(size=14),
    legend_title_text="<b>Remote Work Type</b>",
    margin=dict(l=60, r=40, t=80, b=60)
)

# Final touch: consistent visual polish
fig.update_traces(marker=dict(size=6))
fig.update_xaxes(showgrid=True, gridcolor="lightgray", zeroline=False)
fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)

fig.show()

5 Salary vs Experience

The scatterplot shows a clear positive relationship between experience and salary across all remote work types. As minimum years of experience increase, average salary consistently rises, supporting labor economics findings that work experience contributes directly to wage growth (Mincer (1974); Bureau of Labor Statistics (2024)). Among the three groups, Remote and Hybrid Remote roles generally track above Not Remote positions, suggesting that employers may offer higher pay for flexible or remote-friendly jobs (McKinsey & Company (2022)). This mirrors recent research showing that post-pandemic flexibility often correlates with higher total compensation, especially for data professionals (Glassdoor Economic Research (2024)).

Code

#  Filter valid salary data
df_roles = df[df["Average_Salary"] > 0].copy()

#  Use a solid rainbow palette
rainbow_colors = pc.sample_colorscale("Rainbow", [0.0, 0.2, 0.4, 0.6, 0.8, 1.0])

#  Create the boxplot
fig = px.box(
    df_roles,
    x="ROLE_GROUP",
    y="Average_Salary",
    color="ROLE_GROUP",
    color_discrete_sequence=rainbow_colors,  
    title="Salary Comparison Across Role Categories",
    points="outliers",   # show only outliers for clarity
    height=550
)

#  Layout polish
fig.update_layout(
    title=dict(
        text="<b>Salary Comparison Across Role Categories</b>",
        x=0.5,
        xanchor="center",
        font=dict(size=20)
    ),
    xaxis_title="<b>Role Group</b>",
    yaxis_title="<b>Average Salary (USD)</b>",
    template="plotly_white",
    width=900,
    height=550,
    font=dict(size=14),
    legend_title_text="<b>ROLE_GROUP</b>",
    margin=dict(l=60, r=40, t=80, b=120)
)

#  Axis & tick styling
fig.update_xaxes(
    tickangle=30,
    tickfont=dict(size=12),
    showgrid=False
)
fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)

fig.show()

6 Role-Based Salary Trends

The boxplot shows clear salary variation across professional role groups. Network and Systems Engineering roles offer the highest median salaries and the widest overall pay range, consistent with labor market reports highlighting elevated compensation for infrastructure and AI-related talent (Lightcast (2024)). In contrast, Business Analysis and Marketing Specialist roles cluster around lower medians, reflecting standardized pay structures common in business-facing positions (Glassdoor Economic Research (2024)). Overall, the trend highlights that technical specialization and data infrastructure expertise command higher pay within analytics fields (Bone, Ehlinger, and Stephany (2025)).

Code

#  Copy and clean up data
df_remote = df[df["Average_Salary"] > 0].copy()

#  Combine '[None]' and 'Not Remote' into 'Onsite'
df_remote["REMOTE_GROUP"] = df_remote["REMOTE_GROUP"].replace({
    None: "Onsite",
    "[None]": "Onsite",
    "Not Remote": "Onsite"
})

#  Define custom order for clarity
remote_order = ["Onsite", "Hybrid Remote", "Remote"]

#  Use a solid rainbow palette
rainbow_colors = pc.sample_colorscale("Rainbow", [0.0, 0.5, 1.0])

#  Build polished boxplot
fig = px.box(
    df_remote,
    x="REMOTE_GROUP",
    y="Average_Salary",
    color="REMOTE_GROUP",
    category_orders={"REMOTE_GROUP": remote_order},
    color_discrete_sequence=rainbow_colors,
    title="Salary Distribution by Remote Work Type",
    points="outliers",  # show only outliers
    height=550
)

#  Aesthetic refinements
fig.update_layout(
    title=dict(
        text="<b>Salary Distribution by Remote Work Type</b>",
        x=0.5,
        xanchor="center",
        font=dict(size=20)
    ),
    xaxis_title="<b>Remote Work Type</b>",
    yaxis_title="<b>Average Salary (USD)</b>",
    template="plotly_white",
    width=900,
    height=550,
    font=dict(size=14),
    legend_title_text="<b>Remote Work Type</b>",
    legend=dict(
        orientation="v",
        yanchor="top",
        y=0.98,
        xanchor="right",
        x=1.12,
        font=dict(size=12)
    ),
    margin=dict(l=60, r=120, t=80, b=100)
)

#  Axis polish
fig.update_xaxes(
    tickangle=0,
    showgrid=False,
    tickfont=dict(size=13)
)
fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)

fig.show()

7 Remote Work vs Salary Trends

The salary distribution varies noticeably by remote work type. Remote roles show the highest median salaries and a slightly wider interquartile range, indicating greater earning potential and variability among remote positions. Hybrid Remote jobs follow closely, while Onsite roles exhibit lower medians and tighter spreads. This mirrors workforce evidence that remote-capable roles—particularly in analytics and technology—carry a wage premium and higher demand (McKinsey & Company (2022); Lightcast (2024)). The findings align with broader trends emphasizing flexibility and digital collaboration as valued aspects of compensation (Glassdoor Economic Research (2024)).

Code

#  Use the detected skill column
skills_column = "COMMON_SKILLS_NAME"  # replace if another had more data

#  Flatten and clean skills
skills_flat = [
    s.strip().replace('"', '')  # remove quotation marks
    for sublist in df[skills_column].dropna().astype(str).str.split(',')
    for s in sublist if s.strip()
]

#  Count top 15 skills
skill_counts = pd.DataFrame(
    Counter(skills_flat).most_common(15),
    columns=["Skill", "Count"]
)

#  Generate a rainbow color palette for the bars
rainbow_colors = pc.sample_colorscale("Rainbow", [i / 14 for i in range(15)])

#  Bar chart
fig = px.bar(
    skill_counts,
    x="Skill",
    y="Count",
    title="Top 15 Most Frequent Skills",
    color="Skill",
    color_discrete_sequence=rainbow_colors
)

#  Clean layout
fig.update_layout(
    title=dict(
        text="<b>Top 15 Most Frequent Skills</b>",
        x=0.5,
        xanchor="center",
        font=dict(size=22)
    ),
    xaxis_title="<b>Skill</b>",
    yaxis_title="<b>Number of Job Postings</b>",
    template="plotly_white",
    width=900,
    height=550,
    font=dict(size=14),
    margin=dict(l=60, r=40, t=80, b=120)
)

#  Axis styling
fig.update_xaxes(
    tickangle=35,
    tickfont=dict(size=12),
    showgrid=False,
    tickvals=list(range(len(skill_counts))),
    ticktext=[skill.replace(" (Programming Language)", "") for skill in skill_counts["Skill"]]
)
fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)

fig.show()

8 Top Skills Frequency

The top skills highlight a blend of technical, analytical, and interpersonal competencies valued across data-driven roles. Communication and Data Analysis appear most frequently, underscoring the importance of translating insights into business impact. SQL, Python, and Microsoft Excel remain core technical skills, while Project Management and Leadership emphasize strategic coordination (DataCamp (2025); Lightcast (2024)). This distribution reflects the “T-shaped” professional model, where deep technical knowledge is complemented by broad problem-solving and collaboration abilities (Bone, Ehlinger, and Stephany (2025); Mäkelä and Stephany (2024)).

Code

#  Compute correlation matrix
corr_cols = ["Average_Salary", "MIN_YEARS_EXPERIENCE", "MAX_YEARS_EXPERIENCE"]
corr = df[corr_cols].corr().round(2)

#  Create a rainbow colorscale
rainbow_colorscale = pc.make_colorscale(pc.sample_colorscale("Rainbow", [0.0, 0.25, 0.5, 0.75, 1.0]))

#  Create heatmap
fig = ff.create_annotated_heatmap(
    z=corr.values,
    x=corr.columns.tolist(),
    y=corr.columns.tolist(),
    colorscale=rainbow_colorscale,
    showscale=True,
    annotation_text=corr.values,
    font_colors=["black"],
    hoverinfo="none"
)

#  Layout adjustments
fig.update_layout(
    title=dict(
        text="<b>Feature Correlation Matrix</b>",
        x=0.5,
        xanchor="center",
        font=dict(size=20)
    ),
    template="plotly_white",
    width=700,
    height=500,
    margin=dict(l=100, r=100, t=120, b=80), 
    font=dict(size=14)
)

#  Axis polish
fig.update_xaxes(
    side="bottom",
    tickangle=35,
    tickfont=dict(size=12),
    title_standoff=10
)
fig.update_yaxes(
    tickfont=dict(size=12),
    autorange="reversed"
)

fig.show()

9 Feature Correlation Analysis Heatmap

This correlation analysis evaluates the linear relationships between Average Salary, Minimum Years of Experience, and Maximum Years of Experience. The results show that salary is moderately correlated with both experience metrics — 0.51 with minimum and 0.58 with maximum years of experience. This indicates that as experience increases, compensation tends to rise, though not perfectly linearly (Bureau of Labor Statistics (2024); Mincer (1974)). The strong correlation (1.0) between minimum and maximum experience suggests these variables are closely tied — employers typically define experience ranges that scale together (Lightcast (2024)).

Overall, the heatmap confirms that experience is a meaningful predictor of salary, supporting its inclusion as a key numerical feature in subsequent regression modeling (Glassdoor Economic Research (2024); McKinsey & Company (2022)).

References

Bone, M., E. G. Ehlinger, and F. Stephany. (2025): “Skills or degree? The rise of skill-based hiring for AI and green jobs,” Technological Forecasting and Social Change, 214, 124042.

Bureau of Labor Statistics. (2024): “Occupational Employment and Wage Statistics,”https://www.bls.gov/oes/.

DataCamp. (2025): “Machine Learning Courses | DataCamp,”

Glassdoor Economic Research. (2024): “Data Science Salaries and Market Trends,”

Lightcast. (2024): “Global Labor Market Insights Report,”https://lightcast.io/resources.

Mäkelä, E., and F. Stephany. (2024): “Complement or substitute? How AI increases the demand for human skills,” arXiv preprint arXiv:2412.19754,.

McKinsey & Company. (2022): The State of Remote Work: Productivity, Pay, and Preferences,McKinsey Global Institute.

Mincer, J. (1974): Schooling, Experience, and Earnings, National Bureau of Economic Research.