Visual and Statistical Insights from Job Market Data
Author
Affiliation
Anu Sharma, Cindy Guzman, Gavin Boss
Boston University
Published
October 9, 2025
1 Overview
This Exploratory Data Analysis (EDA) examines job postings from Lightcast to uncover salary trends, experience requirements, remote work dynamics, and skill demand across Business Analytics (BA), Data Science (DS), and Machine Learning (ML) roles. Lightcast is a leading labor market analytics provider, whose datasets are widely used to study real-time hiring trends and skills demand (Lightcast (2024)). The analysis prepares the dataset for modeling and provides statistical and visual insights that guide the regression and feature engineering stages of the project.
Specifically, this section covers:
Data preparation and cleaning – handling duplicates, renaming key columns, converting data types, and computing an Average_Salary variable
Salary distribution and outliers – visualizing the spread of compensation and identifying high-paying ML and senior roles
Experience and salary relationships – analyzing how required years of experience influence pay across remote and hybrid work types
Role-based salary comparison – comparing median salaries for BA, DS, and ML roles
Remote work trends – exploring how work flexibility impacts compensation
Top skill frequencies – identifying the most in-demand technical and analytical skills in job descriptions
Feature correlations – assessing relationships between numeric variables such as salary and experience to inform model selection
Together, these insights establish a clear understanding of labor market patterns and help define which features are most predictive for the upcoming salary regression models.
2 Load and Prepare Data
Code
import pandas as pdimport numpy as npimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsimport plotly.figure_factory as ffimport plotly.io as pioimport plotly.colors as pcimport jsonimport refrom collections import Counterpio.templates.default ="plotly_white"pio.renderers.default ="iframe_connected"# === Load data from CSV ===df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)# print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")# --- Detect & drop duplicate columns ---# --- Detect & fully clean duplicate-like columns ---# Normalize column names: strip whitespace and hidden charactersdf.columns = df.columns.str.strip().str.replace(r"\s+", " ", regex=True)# Collapse exact duplicates after cleanupbefore_cols =len(df.columns)df = df.loc[:, ~df.columns.duplicated()]after_cols =len(df.columns)# print(f" Cleaned column names: removed {before_cols - after_cols} duplicate(s).")# print("Unique columns now:", len(df.columns))# --- Convert numeric columns safely ---for col in ["SALARY_FROM", "SALARY_TO", "MIN_YEARS_EXPERIENCE", "MAX_YEARS_EXPERIENCE"]:if col in df.columns: df[col] = pd.to_numeric(df[col], errors="coerce")# --- Compute average salary (avoid string concat) ---if {"SALARY_FROM", "SALARY_TO"}.issubset(df.columns): df["Average_Salary"] = df[["SALARY_FROM", "SALARY_TO"]].mean(axis=1, skipna=True)# --- Clean and rename (safely) ---rename_map = {"REMOTE_TYPE_NAME": "REMOTE_GROUP","STATE_NAME": "STATE","LOT_V6_OCCUPATION_GROUP_NAME": "ROLE_GROUP"}# Only rename columns that won't create duplicatesfor old, new in rename_map.items():if old in df.columns and new notin df.columns: df.rename(columns={old: new}, inplace=True)elif old in df.columns and new in df.columns:# print(f" Skipping rename '{old}' → '{new}' to avoid duplicate column name.")pass# --- Drop invalid rows early ---if"Average_Salary"in df.columns: df = df[df["Average_Salary"].notna() & (df["Average_Salary"] >0)]# --- Downsample if dataset is large ---iflen(df) >5000: df = df.sample(5000, random_state=42)# print(f" Loaded {len(df)} rows safely with {len(df.columns)} unique columns.")
3 Data Preparation and Cleaning
The dataset used for this analysis was sourced from Lightcast job postings, containing thousands of listings across Business Analytics (BA), Data Science (DS), and Machine Learning (ML) roles. To ensure data quality and consistency:
Column Normalization: Extra spaces and hidden characters were stripped from column names to avoid mismatches and duplicates.
Duplicate Removal: Identical columns were collapsed to retain only unique fields.
Type Conversion: Numeric columns such as SALARY_FROM, SALARY_TO, MIN_YEARS_EXPERIENCE, and MAX_YEARS_EXPERIENCE were coerced into numeric format, with non-numeric values safely converted to NaN.
Average Salary Calculation: A new feature, Average_Salary, was computed as the mean of the salary range for each posting to simplify analysis.
Column Renaming: Key columns were standardized (REMOTE_TYPE_NAME → REMOTE_GROUP, STATE_NAME → STATE, LOT_V6_OCCUPATION_GROUP_NAME → ROLE_GROUP) for clarity.
Invalid Data Removal: Rows with missing or non-positive salaries were dropped.
Downsampling: For performance, the dataset was randomly reduced to 5,000 representative rows, preserving the statistical diversity of the original population.
This preprocessing established a clean, consistent dataset suitable for visualization and modeling.
Code
# Filter valid salariesdf_salary = df[df["Average_Salary"] >0]# --- Base histogram (same as your original) ---fig = px.histogram( df_salary, x="Average_Salary", nbins=40, color_discrete_sequence=["#B03A73"], title="Distribution of Average Salaries")# --- Compute log overlay using numpy ---log_salaries = np.log(df_salary["Average_Salary"])counts, bins = np.histogram(log_salaries, bins=40)bin_centers = (bins[:-1] + bins[1:]) /2log_curve_x = np.exp(bin_centers)# Get reference histogram counts to scale properlylinear_counts, _ = np.histogram(df_salary["Average_Salary"], bins=40)scale_factor =max(linear_counts) /max(counts)log_curve_y = counts * scale_factor# --- Add overlay line ---fig.add_trace(go.Scatter( x=log_curve_x, y=log_curve_y, mode="lines", line=dict(color="#880E4F", width=3), # deeper magenta for contrast name="Log-transformed Overlay"))# --- Keep your layout exactly the same ---fig.update_layout( title=dict( text="<b>Distribution of Average Salaries (with Log Overlay)</b>", x=0.5, xanchor="center", font=dict(size=20) ), xaxis_title="<b>Average Salary (USD)</b>", yaxis_title="<b>Number of Job Postings</b>", template="plotly_white", width=900, height=550, bargap=0.05, font=dict(size=14), margin=dict(l=60, r=40, t=80, b=60), legend=dict(title="Legend", orientation="h", y=-0.2, x=0.3))fig.update_xaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.show()
4 Salary Distribution
The salary distribution is right-skewed, indicating that most job postings fall within the $80K–$150K range. A smaller number of positions extend above $200K, reflecting higher-paying senior and Machine Learning roles. This pattern aligns with national data showing that advanced technical and AI-related positions command higher compensation due to specialized skill requirements (Bureau of Labor Statistics (2024); Bone, Ehlinger, and Stephany (2025)). Such variation underscores the growing salary dispersion across digital and data-driven occupations (Mäkelä and Stephany (2024)).
Code
# Filter out invalid or missing valuesdf_exp = df[(df["Average_Salary"] >0) & (df["MIN_YEARS_EXPERIENCE"] >=0)].copy()# Combine '[None]' and 'Not Remote' into 'Onsite'df_exp["REMOTE_GROUP"] = df_exp["REMOTE_GROUP"].replace({None: "Onsite","[None]": "Onsite","Not Remote": "Onsite"})# Add gentle jitter to make overlapping points clearer jitter_strength =0.2df_exp["MIN_YEARS_EXPERIENCE_JITTER"] = df_exp["MIN_YEARS_EXPERIENCE"] + np.random.uniform( low=-jitter_strength, high=jitter_strength, size=len(df_exp))df_exp["Average_Salary_JITTER"] = df_exp["Average_Salary"] + np.random.uniform( low=-2000, high=2000, size=len(df_exp))# Use a rainbow-like palette manually rainbow_colors = pc.sample_colorscale("Rainbow", [0.0, 0.25, 0.5, 0.75, 1.0])# Create scatter plot with rainbow gradient colors fig = px.scatter( df_exp, x="MIN_YEARS_EXPERIENCE_JITTER", y="Average_Salary_JITTER", color="REMOTE_GROUP", trendline="ols", title="Salary vs. Minimum Experience by Remote Type", color_discrete_sequence=rainbow_colors, opacity=0.7, height=550)fig.update_layout( title=dict( text="<b>Salary vs. Minimum Experience by Remote Type</b>", x=0.5, xanchor="center", font=dict(size=20) ), xaxis_title="<b>Minimum Years of Experience</b>", yaxis_title="<b>Average Salary (USD)</b>", template="plotly_white", width=900, height=550, font=dict(size=14), legend_title_text="<b>Remote Work Type</b>", margin=dict(l=60, r=40, t=80, b=60))# Final touch: consistent visual polishfig.update_traces(marker=dict(size=6))fig.update_xaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.show()
5 Salary vs Experience
The scatterplot shows a clear positive relationship between experience and salary across all remote work types. As minimum years of experience increase, average salary consistently rises, supporting labor economics findings that work experience contributes directly to wage growth (Mincer (1974); Bureau of Labor Statistics (2024)). Among the three groups, Remote and Hybrid Remote roles generally track above Not Remote positions, suggesting that employers may offer higher pay for flexible or remote-friendly jobs (McKinsey & Company (2022)). This mirrors recent research showing that post-pandemic flexibility often correlates with higher total compensation, especially for data professionals (Glassdoor Economic Research (2024)).
Code
# Filter valid salary datadf_roles = df[df["Average_Salary"] >0].copy()# Use a solid rainbow paletterainbow_colors = pc.sample_colorscale("Rainbow", [0.0, 0.2, 0.4, 0.6, 0.8, 1.0])# Create the boxplotfig = px.box( df_roles, x="ROLE_GROUP", y="Average_Salary", color="ROLE_GROUP", color_discrete_sequence=rainbow_colors, title="Salary Comparison Across Role Categories", points="outliers", # show only outliers for clarity height=550)# Layout polishfig.update_layout( title=dict( text="<b>Salary Comparison Across Role Categories</b>", x=0.5, xanchor="center", font=dict(size=20) ), xaxis_title="<b>Role Group</b>", yaxis_title="<b>Average Salary (USD)</b>", template="plotly_white", width=900, height=550, font=dict(size=14), legend_title_text="<b>ROLE_GROUP</b>", margin=dict(l=60, r=40, t=80, b=120))# Axis & tick stylingfig.update_xaxes( tickangle=30, tickfont=dict(size=12), showgrid=False)fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.show()
6 Role-Based Salary Trends
The boxplot shows clear salary variation across professional role groups. Network and Systems Engineering roles offer the highest median salaries and the widest overall pay range, consistent with labor market reports highlighting elevated compensation for infrastructure and AI-related talent (Lightcast (2024)). In contrast, Business Analysis and Marketing Specialist roles cluster around lower medians, reflecting standardized pay structures common in business-facing positions (Glassdoor Economic Research (2024)). Overall, the trend highlights that technical specialization and data infrastructure expertise command higher pay within analytics fields (Bone, Ehlinger, and Stephany (2025)).
Code
# Copy and clean up datadf_remote = df[df["Average_Salary"] >0].copy()# Combine '[None]' and 'Not Remote' into 'Onsite'df_remote["REMOTE_GROUP"] = df_remote["REMOTE_GROUP"].replace({None: "Onsite","[None]": "Onsite","Not Remote": "Onsite"})# Define custom order for clarityremote_order = ["Onsite", "Hybrid Remote", "Remote"]# Use a solid rainbow paletterainbow_colors = pc.sample_colorscale("Rainbow", [0.0, 0.5, 1.0])# Build polished boxplotfig = px.box( df_remote, x="REMOTE_GROUP", y="Average_Salary", color="REMOTE_GROUP", category_orders={"REMOTE_GROUP": remote_order}, color_discrete_sequence=rainbow_colors, title="Salary Distribution by Remote Work Type", points="outliers", # show only outliers height=550)# Aesthetic refinementsfig.update_layout( title=dict( text="<b>Salary Distribution by Remote Work Type</b>", x=0.5, xanchor="center", font=dict(size=20) ), xaxis_title="<b>Remote Work Type</b>", yaxis_title="<b>Average Salary (USD)</b>", template="plotly_white", width=900, height=550, font=dict(size=14), legend_title_text="<b>Remote Work Type</b>", legend=dict( orientation="v", yanchor="top", y=0.98, xanchor="right", x=1.12, font=dict(size=12) ), margin=dict(l=60, r=120, t=80, b=100))# Axis polishfig.update_xaxes( tickangle=0, showgrid=False, tickfont=dict(size=13))fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.show()
7 Remote Work vs Salary Trends
The salary distribution varies noticeably by remote work type. Remote roles show the highest median salaries and a slightly wider interquartile range, indicating greater earning potential and variability among remote positions. Hybrid Remote jobs follow closely, while Onsite roles exhibit lower medians and tighter spreads. This mirrors workforce evidence that remote-capable roles—particularly in analytics and technology—carry a wage premium and higher demand (McKinsey & Company (2022); Lightcast (2024)). The findings align with broader trends emphasizing flexibility and digital collaboration as valued aspects of compensation (Glassdoor Economic Research (2024)).
Code
# Use the detected skill columnskills_column ="COMMON_SKILLS_NAME"# replace if another had more data# Flatten and clean skillsskills_flat = [ s.strip().replace('"', '') # remove quotation marksfor sublist in df[skills_column].dropna().astype(str).str.split(',')for s in sublist if s.strip()]# Count top 15 skillsskill_counts = pd.DataFrame( Counter(skills_flat).most_common(15), columns=["Skill", "Count"])# Generate a rainbow color palette for the barsrainbow_colors = pc.sample_colorscale("Rainbow", [i /14for i inrange(15)])# Bar chartfig = px.bar( skill_counts, x="Skill", y="Count", title="Top 15 Most Frequent Skills", color="Skill", color_discrete_sequence=rainbow_colors)# Clean layoutfig.update_layout( title=dict( text="<b>Top 15 Most Frequent Skills</b>", x=0.5, xanchor="center", font=dict(size=22) ), xaxis_title="<b>Skill</b>", yaxis_title="<b>Number of Job Postings</b>", template="plotly_white", width=900, height=550, font=dict(size=14), margin=dict(l=60, r=40, t=80, b=120))# Axis stylingfig.update_xaxes( tickangle=35, tickfont=dict(size=12), showgrid=False, tickvals=list(range(len(skill_counts))), ticktext=[skill.replace(" (Programming Language)", "") for skill in skill_counts["Skill"]])fig.update_yaxes(showgrid=True, gridcolor="lightgray", zeroline=False)fig.show()
8 Top Skills Frequency
The top skills highlight a blend of technical, analytical, and interpersonal competencies valued across data-driven roles. Communication and Data Analysis appear most frequently, underscoring the importance of translating insights into business impact. SQL, Python, and Microsoft Excel remain core technical skills, while Project Management and Leadership emphasize strategic coordination (DataCamp (2025); Lightcast (2024)). This distribution reflects the “T-shaped” professional model, where deep technical knowledge is complemented by broad problem-solving and collaboration abilities (Bone, Ehlinger, and Stephany (2025); Mäkelä and Stephany (2024)).
This correlation analysis evaluates the linear relationships between Average Salary, Minimum Years of Experience, and Maximum Years of Experience. The results show that salary is moderately correlated with both experience metrics — 0.51 with minimum and 0.58 with maximum years of experience. This indicates that as experience increases, compensation tends to rise, though not perfectly linearly (Bureau of Labor Statistics (2024); Mincer (1974)). The strong correlation (1.0) between minimum and maximum experience suggests these variables are closely tied — employers typically define experience ranges that scale together (Lightcast (2024)).
Bone, M., E. G. Ehlinger, and F. Stephany. (2025): “Skills or degree? The rise of skill-based hiring for AI and green jobs,”Technological Forecasting and Social Change, 214, 124042.
Bureau of Labor Statistics. (2024): “Occupational Employment and Wage Statistics,”https://www.bls.gov/oes/.