Data Engineer

Washington, Washington DC

Colorectal Cancer Alliance
Apply for this Job
TITLE: Data Engineer

ORGANIZATION: Colorectal Cancer Alliance

LOCATION: Washington DC based, Hybrid

POSITION TYPE: Full-Time, Exempt

REPORTS TO: Chief Date & Analytics Officer

COMPENSATION: $130,000-$150,000 annual salary; Healthcare benefits are available for this role.

ORGANIZATION OVERVIEW:

The Colorectal Cancer Alliance is a national nonprofit organization committed to ending colorectal cancer within our lifetime. We help patients, families, survivors, and caregivers navigate diagnosis and treatment options, connect them with those who can share experiences and knowledge, and identify resources to meet their needs. We partner with healthcare professionals and social influencers to raise awareness of preventative screening, and we collaborate with researchers to better understand the disease and fund critical research. Our efforts are urgent, effective, and efficient because we believe that tomorrow can't wait.

POSITION OVERVIEW:

At the Colorectal Cancer Alliance, we are building an innovative patient-centric, data-driven precision oncology platform to transform the future of colorectal cancer awareness, care and research. Our core systems - BlueHQ, BlueLake, and K-SPY - are designed to empower patients, caregivers, healthcare providers, and researchers through scalable, interoperable, and AI-ready architectures.

The Data Engineer will lead the design, development, and optimization of cloud-based data pipelines that fuel our real-world data and precision oncology platforms - BlueHQ, BlueLake, and K-SPY.

This role focuses on extracting, transforming, harmonizing, and delivering high-quality clinical, patient-reported, and engagement data from sources such as REDCap, Salesforce NPC, AWS services, and external health systems into our federated research and analytics environments.

The ideal candidate brings deep technical expertise in AWS-native data engineering, a strong foundation in HIPAA-compliant data workflows, and a passion for enabling longitudinal patient journeys, clinical trial matching, and real-world evidence generation through structured, governed, and scalable datasets.

You will collaborate closely with platform engineers, navigators, and research stakeholders to build the data foundation supporting advanced analytics, AI/ML models, patient navigation and care, and future patient-centered discoveries.

POSITION RESPONSIBILITIES:

Key responsibilities include, but are not limited to:
  • Design, build, and optimize scalable data pipelines to extract, transform, and load (ETL/ELT) data from REDCap, Salesforce NPC, AWS-based platforms (HealthLake, Redshift, S3), and external partners into BlueLake and downstream analytics environments.
  • Develop and maintain APIs, connectors, and integration workflows for seamless real-time or near-real-time data movement across federated systems, supporting zero-copy data federation (Athena, Redshift Spectrum) and other diverse data flows.
  • Model and harmonize clinical, patient-reported, navigation, trial, and real-world data into semantically enriched datasets aligned to FHIR, OMOP, RDF, and internal Single Canonical Form (SCF) standards.
  • Collaborate closely with Data Managers, Systems Engineers and Platform Architects to operationalize metadata-driven architectures, event-driven data contracts, and interoperable patient-centric frameworks.
  • Partner with metadata governance systems (DataHub or similar) to maintain data lineage, provenance, semantic clarity, and compliance across the platform ecosystem.
  • Support early-stage design, storage, and querying of linked knowledge graphs (e.g., RDF/SPARQL) connecting patients, biomarkers, navigation events, trials, and outcomes.
  • Implement dbt models, CI/CD pipelines, and version control practices for structured, auditable data transformations ready for analytics, AI/ML models, and predictive patient navigation.
  • Build robust validation, monitoring, and data quality frameworks to ensure data integrity, reliability, and compliance with HIPAA, GDPR, 21 CFR Part 11, and IRB-approved protocols.
  • Support longitudinal tracking of patient journeys, including survivorship, recurrence, biomarker evolution, navigation milestones, and real-world outcomes, across federated multi-source environments.
  • Develop scalable, audit-friendly processes for ongoing real-world data (RWD) collection, multi-source harmonization, semantic enrichment, and analytical enablement.
  • Maintain detailed technical documentation for pipelines, transformations, and governed data assets to ensure operational transparency, reproducibility, and audit readiness.
  • Collaborate actively across navigation, research, analytics, development, and data governance teams to ensure the interoperability, usability, and strategic advancement of the organization's data infrastructure.
REQUIRED QUALIFICATIONS
  • Minimum of 5+ years of experience in data engineering, ETL/ELT pipeline development, and cloud-based data environments.
  • AWS Certified - Specialty certification required (must be active).
  • Strong proficiency with AWS-native data services, including AWS Glue, Redshift (or Redshift Spectrum), S3, Lake Formation, and Athena; familiarity with AWS HealthLake and SageMaker is preferred.
  • Proven expertise in building scalable ETL/ELT pipelines using Python, SQL, and dbt (data build tool).
  • Experience integrating structured and semi-structured data sources (e.g., REDCap, Salesforce, CSVs, FHIR JSON) into centralized or federated repositories.
  • Solid understanding of data modeling (normalized, star, and snowflake schemas), semantic enrichment principles, and zero-copy/federated architecture approaches.
  • Hands-on knowledge of metadata-driven pipeline design, data governance concepts (lineage, cataloging, privacy, security), and regulatory compliance frameworks (HIPAA, GDPR, 21 CFR Part 11).
  • Familiarity with healthcare and research interoperability standards such as FHIR, OMOP, HL7, or RDF/semantic web technologies.
  • Strong experience monitoring and validating data quality across pipelines (completeness, consistency, accuracy), and data observability
  • Comfort with API-first architectures, including RESTful APIs and GraphQL-based data interactions.
  • Experience with real-time or event-driven data ingestion using tools such as Kafka, Kinesis, or AWS EventBridge.
  • Excellent technical writing, documentation, and communication skills.
PREFERRED QUALIFICATIONS
  • AWS Certified Cloud Practitioner and Solutions Architect (Associate or Professional) certification.
  • Deep familiarity with REDCap database structures, MySQL backends, and API-driven data extraction workflows.
  • Experience supporting clinical trial data management, patient registries, or real-world evidence (RWE) studies.
  • Hands-on experience implementing metadata cataloging platforms such as Atlan, DataHub, or Amundsen, and designing metadata-driven ingestion frameworks.
  • Knowledge of healthcare and clinical research data standards including CDISC, OMOP, FHIR, and NCIT ontologies.
  • Expertise working with de-identified and limited datasets under HIPAA Safe Harbor and Expert Determination methodologies.
  • Familiarity with Salesforce Nonprofit Cloud and MuleSoft-mediated system integrations.
  • Experience designing and maintaining knowledge graphs or semantic integration platforms (e.g., AWS Neptune, Stardog).
  • Proficiency building federated patient cohorts across multi-source environments to support clinical research and RWE generation.
  • Exposure to event-driven architectures (e.g., AWS EventBridge) and serverless data ingestion patterns.
  • Experience preparing datasets for machine learning applications (e.g., SageMaker Feature Store pipelines).
  • Strong proficiency with CI/CD practices for data workflows (e.g., dbt Cloud, GitHub Actions).

SALARY RANGE:

Competitive non-profit salary, typically ranging from $130,000 - 150,000, based on experience and qualifications.

STATEMENT OF NON-DISCRIMINATION: The Colorectal Cancer Alliance does not discriminate on the basis of race, color, gender, disability, age, religion, sexual orientation, nationality, or ethnicity. We are strongly committed to hiring a diverse and multicultural staff and encourage applications from all backgrounds.

HOW TO APPLY: To apply, please complete the application in our ADP Workforce Now application portal.

To see all employment opportunities at the Alliance, please click here to be directed to our careers site.

If you encounter any issues with this application, please contact us at
Date Posted: 10 May 2025
Apply for this Job