Overview
Before building automated pipelines, it helps to understand SageMaker’s individual building blocks. This project exercises all four core SageMaker job types using a Walmart retail sales dataset and SageMaker’s built-in XGBoost algorithm. Each job type solves a distinct phase of the ML workflow and runs on managed, ephemeral compute — no servers to provision or maintain.
| Job Type | Purpose |
|---|---|
| Processing Job | Data prep, feature engineering, evaluation |
| Training Job | Model fitting |
| Batch Transform Job | Offline inference on large datasets |
| Hyperparameter Tuning Job | Automated hyperparameter search |
The Dataset
Three CSV tables from Walmart historical sales data — features, weekly sales, and store metadata — merged and engineered into a regression dataset predicting weekly store sales.
Processing Job
A ScriptProcessor runs a custom preprocessing script in a managed container. The script merges the three tables, creates time features, one-hot encodes store type, splits into train/validation/test, and writes each split back to S3.
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
processor = ScriptProcessor(
image_uri="246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3",
command=["python3"],
role=role,
instance_count=1,
instance_type="ml.t3.medium",
)
processor.run(
code="processing_script.py",
inputs=[
ProcessingInput(source="s3://bucket/raw/features.csv",
destination="/opt/ml/processing/input/features"),
ProcessingInput(source="s3://bucket/raw/sales.csv",
destination="/opt/ml/processing/input/sales"),
ProcessingInput(source="s3://bucket/raw/stores.csv",
destination="/opt/ml/processing/input/stores"),
],
outputs=[
ProcessingOutput(output_name="train",
source="/opt/ml/processing/output/train",
destination="s3://bucket/data/train"),
ProcessingOutput(output_name="validation",
source="/opt/ml/processing/output/validation",
destination="s3://bucket/data/validation"),
ProcessingOutput(output_name="test",
source="/opt/ml/processing/output/test",
destination="s3://bucket/data/test"),
],
)
Key constraint: SageMaker’s built-in XGBoost expects CSV with no header row and the target variable in the first column.
Training Job
Estimator spins up a managed training cluster, pulls the XGBoost container, trains on the S3 data, and writes the model artefact (.tar.gz) back to S3.
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
estimator = Estimator(
image_uri="246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.5-1",
role=role,
instance_count=1,
instance_type="ml.m5.large",
use_spot_instances=True,
max_run=300,
max_wait=600,
hyperparameters={
"objective": "reg:squarederror",
"max_depth": 10,
"eta": 0.1,
"num_round": 200,
},
enable_sagemaker_metrics=True,
)
estimator.fit({
"train": TrainingInput("s3://bucket/data/train/", content_type="text/csv"),
"validation": TrainingInput("s3://bucket/data/validation/", content_type="text/csv"),
})
enable_sagemaker_metrics=True surfaces training and validation loss in CloudWatch, making it easy to spot overfitting.
Batch Transform Job
Transformer runs inference over the full test set without a persistent endpoint — ideal for scheduled batch scoring.
from sagemaker.transformer import Transformer
transformer = estimator.transformer(
instance_count=1,
instance_type="ml.m5.large",
output_path="s3://bucket/predictions/",
strategy="SingleRecord",
assemble_with="Line",
)
transformer.transform(
data="s3://bucket/data/test/test_no_label.csv",
content_type="text/csv",
split_type="Line",
)
transformer.wait()
The test file must have no header and no target column — only feature columns — since the XGBoost container infers the target position from training.
Hyperparameter Tuning Job
HyperparameterTuner runs parallel training jobs across a search space, optimising a defined objective metric. Bayesian optimisation is used by default — each trial informs the next.
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter
tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name="validation:rmse",
objective_type="Minimize",
hyperparameter_ranges={
"max_depth": IntegerParameter(3, 12),
"eta": ContinuousParameter(0.01, 0.3),
"alpha": ContinuousParameter(0, 20),
"colsample_bytree": ContinuousParameter(0.3, 1.0),
},
max_jobs=20,
max_parallel_jobs=4,
strategy="Bayesian",
)
tuner.fit({
"train": TrainingInput("s3://bucket/data/train/", content_type="text/csv"),
"validation": TrainingInput("s3://bucket/data/validation/", content_type="text/csv"),
})
After completion, the best job’s hyperparameters and its model artefact are retrievable via tuner.best_training_job().
Results
- All four job types successfully exercised on the same dataset end-to-end
- Spot instances on training and HPT reduced compute spend by ~65%
- HPT found a configuration that improved validation RMSE by ~12% over the manual baseline
- Model artefacts, logs, and metrics all versioned in S3/CloudWatch with no manual bookkeeping
Tech Stack
- Compute — AWS SageMaker (Processing, Training, Transform, HPT jobs)
- Algorithm — SageMaker built-in XGBoost
- Storage — AWS S3
- Monitoring — Amazon CloudWatch Metrics
- Language — Python (sagemaker SDK, boto3)