Data Cleaning

Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends

Author

Affiliation

Anu Sharma, Cindy Guzman, Gavin Boss

Boston University

1 Looks for the duration of job postings

This was to ensure that the date format for the data was being pulled in the correct way, this is set up to pull the format of M/D/YYYY, this also makes sur ethat it sorts between expired and posted and fills any values that don’t represent either EXPIRED or POSTED with DURATION value.

Code

from pyspark.sql.functions import col, to_date, datediff, when
#| label: duration
#| include: true

# Use a flexible date format
date_cols = ["POSTED", "EXPIRED", "MODELED_EXPIRED"]
for colname in date_cols:
    df = df.withColumn(colname, col(colname).cast("string"))
    df = df.withColumn(colname, to_date(col(colname), "M/d/yyyy"))

# --- Step 2: Compute durations safely ---
df = df.withColumn(
    "DURATION",
    when(col("DURATION").isNull(), datediff("EXPIRED", "POSTED")).otherwise(col("DURATION"))
).withColumn(
    "MODELED_DURATION",
    when(col("MODELED_DURATION").isNull(), datediff("MODELED_EXPIRED", "POSTED")).otherwise(col("MODELED_DURATION"))
)

2 Column Cleaning

This is to ensure that we are only sorting columns that could potentially be of use, this isn’t every column that was used, but we cut out columns that we knew wouldn’t be relevant for the data that we were going to be incorporating into the datasets. While these columns could have been refined even further to minimize the size of the dataset, but we were able to minimize the way that the EC2 utilized the data frames to incorporate and read the data to where we had the data cleaned just in case we needed it.

Code

# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace

# All columns that are being cleaned
text_columns = [
    "TITLE_RAW", "BODY", "SKILLS_NAME",
    "SPECIALIZED_SKILLS_NAME",
    "CERTIFICATIONS_NAME",
    "COMMON_SKILLS_NAME",
    "SOFTWARE_SKILLS_NAME", "URL",
    "EDUCATION_LEVELS_NAME", "LIGHTCAST_SECTORS_NAME",
    "CIP6_NAME", "CIP4_NAME", "CIP2_NAME",
    "TITLE_NAME", "TITLE_CLEAN",
    "COMPANY_NAME", "COMPANY_RAW",
    "ONET", "ONET_NAME", "ONET_2019", "ONET_2019_NAME",
    "SOC_2021_2_NAME", "SOC_2021_3_NAME", "SOC_2021_4_NAME", "SOC_2021_5_NAME",
    "LOT_CAREER_AREA_NAME", "LOT_OCCUPATION_NAME", "LOT_SPECIALIZED_OCCUPATION_NAME",
    "LOT_OCCUPATION_GROUP_NAME", "LOT_V6_SPECIALIZED_OCCUPATION_NAME",
    "LOT_V6_OCCUPATION_NAME", "LOT_V6_OCCUPATION_GROUP_NAME", "LOT_V6_CAREER_AREA_NAME",
    "NAICS_2022_6_NAME"
]

# clean text columns
def clean_text_columns(df, columns):
    """
    Cleans text-heavy columns by:
    - Removing brackets, quotes, and newlines
    - Replacing multiple spaces with single space
    - Standardizing commas with proper spacing
    """
    pattern_cleanup = r'[\[\]\n{}"]'  # remove brackets, newlines, braces, quotes
    for c in columns:
        df = df.withColumn(c, regexp_replace(col(c), pattern_cleanup, ""))
        df = df.withColumn(c, regexp_replace(col(c), r'\s{2,}', ' '))  # multiple spaces → 1 space
        df = df.withColumn(c, regexp_replace(col(c), r'\s*,\s*', ', '))  # standardize commas
    return df

# Apply cleaning
df = clean_text_columns(df, text_columns)

# Preview cleaned text (Don't need to render for the project just for testing)
# df.select(text_columns).show(5, truncate=100)

3 Targeted column-specific cleaning

This code targets the Education Levels, Location, and Remote work Labels to clean up the dataframes in order to maintain the naming conventions needed for the datasets.

Code

from pyspark.sql.functions import col, regexp_extract, when, regexp_replace
#| label: targeted_cleaning
#| include: true

# Education levels → keep digits
df = df.withColumn("EDUCATION_LEVELS", regexp_extract("EDUCATION_LEVELS", r'(\d+)', 1))

# Location cleanup
df = df.withColumn("LOCATION", regexp_replace(col("LOCATION"), r"\s*\n\s*", " "))
df = df.withColumn("LOCATION", regexp_replace(col("LOCATION"), r"[{}]", ""))

# Standardize remote work labels
df = df.withColumn(
    "REMOTE_TYPE_NAME",
    when(col("REMOTE_TYPE_NAME").isin("[None]", "Not Remote") | col("REMOTE_TYPE_NAME").isNull(), "On-Site")
    .when(col("REMOTE_TYPE_NAME") == "Hybrid Remote", "Hybrid")
    .when(col("REMOTE_TYPE_NAME") == "Remote", "Remote")
    .otherwise(col("REMOTE_TYPE_NAME"))
)

4 Save the cleaned dataset

This saves the dataset, and saves it under our /data folder which is where we save our lightcast.csv, and our new cleaned dataframe also gets saved here and can be pulled from later when it is time to use the data for creating models.