Linux Engineer

Bethesda, Maryland

CCS Global Tech
Job Expired - Click here to search for similar jobs
Responsibilities: Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment. Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and A100 servers within a physical and virtual environment. Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors. Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities. Monitor system performance and resource utilization, identifying and resolving bottlenecks. Automate system administration tasks using scripting languages like Bash and Python. DevOps and Configuration Management: Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance. Maintain and improve system availability through proactive monitoring and automation. Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows. Resource Management: Monitor resource management system (SLURM) to keep resource allocation efficient and aligned with organizational priorities Work directly with users and management to plan and allocate resources effectively. Communicate clearly and proactively regarding resource availability and scheduling. Incident Response and Support: Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner. Analyze recurring problems and implement solutions to prevent reoccurrence. Document incident resolution steps and contribute to root cause analysis efforts. Participate in on-call rotation to provide 24/7/365 support during outages and emergencies. Qualifications: Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree). 2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu). Hands-on experience troubleshooting server hardware failures. Proficiency with configuration management tools (Ansible, Salt). Strong understanding of networking services (DNS, NFS, LDAP, DHCP). Experience with shell scripting and/or Python for automation. Knowledge of Linux security best practices. Excellent troubleshooting and problem-solving skills. Strong communication and interpersonal skills. Ability to work independently and as part of a team.1 DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification. Preferred: Experience with container technologies (Docker, Kubernetes). Familiarity with monitoring tools (Prometheus/Grafana). Knowledge of distributed resource scheduling systems (Slurm, LSF). Experience with CUDA and GPU-accelerated computing systems. Basic understanding of deep learning frameworks and algorithms
Date Posted: 01 April 2025
Job Expired - Click here to search for similar jobs