Site Reliability Engineer

  • Eden Prairie
  • C4 Technical Services
Site Reliability Engineer
Location: Remote
PURPOSE:
This position covers detailed knowledge in operating systems, operational tools, Networks, data base management software and other similar systems support software. Focuses primarily on information and document requirements for data, workflow, logical process, hardware and operating system environment, interfaces with other systems, internal and external checks and controls, and outputs. Embedded in development teams.
Site Reliability Engineers (SREs) are responsible for keeping all customer-facing services and other production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound DevOps engineering principles, operational discipline, and mature automation to our operating environments and the applications' code base.
SREs specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability, and scalability, with varied interests in algorithms and distributed systems. The team's experience feeds back into other Engineering groups within the company, as well as to drive operational maturity.
JOB RESPONSIBILITIES:
  • Breadth and depth knowledge of constructing, refining, and maintaining cloud services to meet business requirement.
  • Execute CI/CD deployment pipelines to deliver various application code deployments to our production environment
  • Build and monitor our production systems utilizing various monitoring and logging tools such as Elastic, Prometheus/Grafana or other similar capability
  • Debug production issues across services and technology stack to remediation following the incident management process.
  • Define and support the adherence to SLA's/SLO's as defined by applications and measure for compliance.
  • Plan for future growth by working with application teams to ensure that scalability needs are considered both in the cloud and on-prem
  • Build a culture of shift left quality working with Product teams and DevOps to implement automation, monitoring and self-healing approaches.
  • Collaborate with Quality Engineering on application performance management supporting monitoring, APM tests, Capacity management and auto-scaling solutions.
  • Focus on automating operational processes using tools like Ansible to remove manual effort
  • Execute Terraform scripts as needed to rebuild infrastructure in situations as needed.
  • With only general direction and guidance conduct special assignments of a highly complex and technical nature and develop alternative strategies which have an important effect on planning or meeting objectives.
  • Document findings of study and prepare recommendations for implementation of new systems, procedures, or organizational changes. Report findings to principal engineer, staff, and management as required.
  • Performs other duties as assigned.
JOB REQUIREMENTS:
Experience:
  • 4-6 years relevant work experience in SRE/DevOps preferred
  • Knowledge of modern software development process such as Agile Scrum, Kanban or Scrumban
Knowledge:
  • Must have strong computer skills within stated area of engineering expertise and must be proficient in use of Microsoft Office applications
Skills/Abilities:
  • Excellent written and verbal communication skills, strong customer focus and demonstrated ability to work in geographically dispersed teams
  • Ability to manage competing priorities while working on concurrent projects and/or simultaneous support tasks across multiple business or technical units
  • Logical thought process and ability to learn new systems, concepts, and procedures
  • Must be able to mentor junior engineers in new systems, concepts, and technical procedures
  • Provide breadth and depth technical advice and serve as a technical training resource to management and staff.
  • Exercise judgment within broadly defined practices and policies in selection methods, techniques, and evaluation criterion for obtaining results.
  • Work with software vendors to resolve issues and implement recommended system and application changes.
  • Apply knowledge base technical expertise in decision making and in the resolution of problems which are highly complex and technical in nature.
  • Must exemplify excellent written and verbal communication skills.
  • Must have experience in CI/CD pipelines to production systems.
  • Must be able to make accurate decisions related to task delegation and provide leadership in filling project and/or support team responsibilities
  • Must be available to work off hours or shifts as required for 24/7/365 support.

#DICE