Site Reliability Engineer

Terminal

Terminal

Software Engineering

Posted on Apr 24, 2026

About VEG

hold your horses—something's not right

As a veterinarian, Dr. David Bessler knew something was wrong with the ER experience. A major piece was missing: the customer's feelings. They were left out in a lobby, kept in the dark about treatment options, and then hit with surprise fees.

working like a dog, but excited as a puppy

Enter VEG. Along with VEG co-founder David Glattstein and a dedicated team, Dr. Bessler created the VEG experience, which puts people and their pets first. Customers are met at the door and that door is open 24/7, even on holidays. And our staff is trained to treat any emergency—from vomiting to surgery.

keeping the flock together

If pets could talk, they'd tell you to stay with them. That's why at VEG, our open floor plan lets you see everything and participate in your pet's care. In fact, you can stay with your pet the entire time, even through surgery.

welcoming scaredy-cats

Want to hold your pet during treatment? YES! We do whatever it takes to take the scary out of a stressful situation. With your pet more at ease, we can have an open conversation about diagnosis, treatment options, and costs.


About The Role

ABOUT VEG In 2014, VEG was born with a mission to help people and their pets when they need it most by challenging norms and fixing the ER experience. Since then, we’ve expanded rapidly, with hospitals nationwide open 24/7/365, and created an ER experience that focuses on what our pets and pet parents really need. We’ve done the same for our people (VEGgies), finding a way to say YES so they are empowered to achieve great things, grow in unexpected ways, and find a place where they truly belong. We’re rethinking emergency care from every angle—from how we run our hospitals to how we support the people working inside them. That’s where our headquarters team comes in. Whether building technology to make our hospitals more efficient, recruiting and growing incredible VEGgies, or bringing our brand to life through marketing, our VQ (VEG Headquarters) team makes it all possible—ensuring our hospitals and people have everything they need to help pets and their families. VEG is a 2025 and 2026 certified Great Place to Work®. THE JOB We are looking for a Senior Site Reliability Engineer who understands that at VEG, "reliability" is a medical necessity – if our proprietary platform, DogByte, goes down, a pet's life could be at risk. You will be the primary lead for our platform's resilience, transforming our infrastructure into a self-healing system that empowers our medical teams to provide 24/7/365 life-saving care. You will spend your time bridging the gap between high-level architectural strategy and hands-on technical "surgery, " ensuring our engineering teams can build at pace while the foundation remains rock-solid. You will evolve and strengthen an existing system that must meet the demands of VEG’s hospital expansion – ensuring our infrastructure never limits our ability to open new hospitals or provide medical care. You will own the ongoing stability of DogByte, scaling it from its current state into a robust enterprise platform where one hospital's traffic is isolated and does not impact another's experience.


What You’ll Do

Formulate short- and long-term strategies to ensure DogByte withstands year-over-year volume increases, specifically solving for hospital-to-hospital traffic isolation Work with engineers to ensure data flows -- from client to API to database -- are configured for high-concurrency and maximum reliability Build automated processes to handle high-traffic spikes and automatically remediate common system errors Set up monitoring and alerting to identify latency throughout the stack and resolve issues before they impact hospital operations Establish and meet SLOs for high availability, ensuring our engineers can build products without worrying if the system can support them


What You’ll Bring

Bachelor’s Degree preferred or equivalent experience ● 5+ years in SRE/DevOps roles, expertly handling high-concurrency environments ● Deep understanding of the AWS ecosystem managed entirely through Infrastructure as Code ● Expertise in traffic management, including load balancing techniques, Nginx configuration, and autoscaling to handle volatile patterns ● Technical leadership in observability, establishing the tracing frameworks and monitoring required to diagnose latency issues and ensure high availability across the entire request lifecycle ● You have direct experience with technologies relevant to our technical stack, which currently includes: AWS ECS, Terraform, Nginx, PostgreSQL (RDS), Python