Linux Systems Engineer - Hybrid onsite - W2 w/referral - Needham, MA
Hourly pay range:benefits - health insurance, dental insurance, 401k, PTO
Visa types accepted: USC/GC
Principal Duties and Responsibilities - Cluster and Systems Administration: Manage and administer production systems used by researchers and Research Centers.
- Ansible Automation - Code refactoring to deploy and maintain systems and applications in Ansible templates.
- Analyze result of server monitoring and implement changes to improve performance, processing and utilization. Proposes, maintains and enforces polices, practices and security procedures.
- Work with users to deploy required applications and docker/singularity applications.
- Analyze and resolve customer and technical problems: Tuning cluster scheduling parameters, memory/CPU contention, scientific application compilation and run-time issues.
- Develop and maintain system documentation as well as user-facing knowledge base articles and how-to guides.
- Evaluate, select and deploy hardware and/or cloud solutions for research scientific computing. This includes CPU and GPU-based compute, high speed networking and data storage.
- Comfortable working within an Agile team (Slurm).
Qualifications - BA/BS engineering degree in a quantitative field or system administration required or equivalent combination of skills/experience.
- 5+ years minimum experience in working with systems administration in Linux environments for a scientific domain including NVIDIA GPU implementations.
- 3+ years of experience with automation and configuration management using Ansible.
- 3+ years of Docker and Kubernetes experience
- A combination of education and experience may be substituted for requirements.
- Demonstrated ability in providing systems administration of up to several hundred Linux servers in an on-premise environment.
- Hands-on experience writing, maintaining Ansible code.Strong skills writing Linux shell scripts in (Bash).
- Experience with monitoring software such as open-Xmode or Prometeus.
- Experience with server deployment technologies (kickstart, PXE, IPMI).
- Understanding of DHCP, DNS, TCP/IP, NFS, SMB and HTTP network protocols.
- Strong verbal and written communication, ability to write clear technical documentation.
- High level of initiative and eagerness to learn new technologies.
- Familiarity with information technology security and data privacy considerations applicable to a healthcare environment is advantageous.
- Knowledge of HPC job scheduling platforms like LSF or Slurm.
- Experience with Git and Jira tools.
- Ability to multitask and prioritize work requirements, keeping team and management informed.
- Experience Kerberos authentication.
- Experience providing support to research investigators with diverse computing needs.