Principal Site Reliability Engineer
UiPath
Location
Tokyo
Employment Type
Full time
Location Type
On-site
Department
Engineering
Life at UiPath
The people at UiPath believe in the transformative power of automation to change how the world works. We’re committed to creating category-leading enterprise software that unleashes that power.
To make that happen, we need people who are curious, self-propelled, generous, and genuine. People who love being part of a fast-moving, fast-thinking growth company. And people who care—about each other, about UiPath, and about our larger purpose.
Could that be you?
"Agentic(エージェンティック)"の最先端で一緒に働いてみませんか?
UiPathは、エンドツーエンドの業務自動化を通じて、これまで日本企業の効率化と変革を支えてきました。今、我々が注力しているのは「エージェンティックオートメーション」。AIエージェント、RPAのロボット、人
を連携させて、企業全体の業務を安全かつ安定的に自動化することです。
UiPath株式会社は本社直下のリージョンに昇格し、日本を最重要拠点と位置づける戦略のもと、日本から世界へソリューションを発信することを目指しています。UiPathは、好奇心旺盛で、自ら進んで動けるフットワークの軽い人材を求めています。ビジネスのスピードや変化を喜びとし、互いを思いやり、ともに成長し続けられる仲間が必要です。UiPathでエージェンティックオートメーションを実現し、共に社会を変革しましょう。
Role Overview
This is a high-impact, principal level role designed for an engineer who excels in the "heat of the moment". Operating with a high degree of autonomy, you will take operational leadership to restore the stability of UiPath’s large-scale distributed services, blending deep technical SRE expertise with the authoritative presence of an Incident Commander.
You will partner closely with platform, infrastructure, and application teams globally to improve service availability, reduce operational toil, and ensure our systems scale reliably under real-world load and failure conditions.
You will act as the Japan regional owner for SRE standards and maintain a close partnership and functional alignment with UiPath’s Global SRE organization. You will also own service reliability, observability, automation, and continuous improvement initiatives for the region.
You will report primarily to Senior Director of Japan and functionally to Vice President - SRE, based in U.S. You will also act in the managerial capacity with another team member reporting to you.
What You’ll Be Working On
1. Incident Command & Tactical Response
• Lead Incident Command: Act as the primary Incident Commander for high-stakes technical events. Establish command and control, orchestrate cross-functional response efforts (Compute, Network, Storage, Database), and maintain a common operating picture for all stakeholders.
• Live Site Troubleshooting: Serve as a key escalation point for complex issues. Use your deep understanding of service topology and dependencies to diagnose "grey failure" and resolve disruptions promptly.
• Executive Communication: Own the communication life cycle. Deliver real-time, executive-level briefings during active incidents, translating technical jargon into clear business impact and recovery timelines for leadership.
2. Prevention & Reliability Engineering
• Post-Incident Evolution: Lead thorough retrospectives and RCAs. Beyond just documenting what happened, you will drive and influence the discovery and implementation of automated self-healing solutions to ensure the same issue never occurs twice.
• Observability: Define, track, and improve service health through promoting well-designed SLIs and SLOs. Influence and implement proactive monitoring, dashboards, and early-warning alerts to identify performance bottlenecks before they trigger an incident.
• Toil Automation: Design and implement automation to reduce manual intervention during incidents and routine operations. Apply engineering rigor to operational workflows to eliminate repetitive and error-prone tasks.
• Service Resilience: Understand the know-how to test service behavior under load, including degradation modes, scaling characteristics, and dependency failures. Ensure backup, restore, and disaster recovery capabilities are implemented, tested, and maintained.
3. Service Design & Cross-functional Leadership
• Architectural Partnership: Partner with development teams to champion high availability and readiness of the services and promote best practices on reliability, resilience, and operability.
• Team Mentorship: Advocate for SRE best practices. Mentor and support other engineers, helping raise the overall incident response and reliability maturity of the organization.
What You’ll Bring to the Team
• Experience: 7+ years in SRE , Cloud Operations, or a related technical field, with at least 3 years in a lead responder or command-oriented role.
• Command Presence: Demonstrated ability to remain calm, focused, and decisive under extreme pressure. You can lead a room of diverse stakeholders and drive technical conversations to successful outcomes.
• Forensics & Investigation: Skills in analyzing system artifacts, network, and performance dashboard data to lead the multi-disciplinary audience to appropriate root cause areas of service failures.
• Technical Breadth: Strong proficiency in Python or Go and a holistic understanding of distributed systems, Kubernetes, and cloud infrastructure (Preferably Azure).
• Observability Expertise: Deep experience with leveraging Prometheus/Grafana, Open Telemetry or any other equivalent 3rd party Observability stack.
• Availability: Willingness to participate in the on-call rotation as an Incident Commander for high-severity issues.
Nice to have
• Command Frameworks: Familiarity with structured command systems (such as the Incident Command System - ICS) used in crisis management.
• LLM Ops: Experience using LLMs or AI-driven detection systems to solve reliability and capacity challenges in GPU-heavy, high-performance computing environments.
• AI Tooling: Champion the use of AI tools and LLM-powered agents to improve SRE pillars including, but not limited to, reducing operational toil.
• Event-Driven Remediation: Proven history of building "self-healing" infrastructure via Terraform, A zure Service Operator, or any other equivalent solutions.
Working Hours & Language Skills
• Working Hours: The role follows a standard work schedule starting at 8:00 a.m. Flexibility may be required to support on-call rotations and respond to incidents, particularly those affecting customers in Japan.
• Language Skills: Strong proficiency in English for effective communication with global functional team members, combined with Japanese proficiency to clearly convey incident details, root causes, and remediation plans to customers and local stakeholders in the Japanese market.
Maybe you don’t tick all the boxes above—but still think you’d be great for the job? Go ahead, apply anyway. Please. Because we know that experience comes in all shapes and sizes—and passion can’t be learned.
Many of our roles allow for flexibility in when and where work gets done. Depending on the needs of the business and the role, the number of hybrid, office-based, and remote workers will vary from team to team. Applications are assessed on a rolling basis and there is no fixed deadline for this requisition. The application window may change depending on the volume of applications received or may close immediately if a qualified candidate is selected.
We value a range of diverse backgrounds, experiences and ideas. We pride ourselves on our diversity and inclusive workplace that provides equal opportunities to all persons regardless of age, race, color, religion, sex, sexual orientation, gender identity, and expression, national origin, disability, neurodiversity, military and/or veteran status, or any other protected classes. Additionally, UiPath provides reasonable accommodations for candidates on request and respects applicants' privacy rights. To review these and other legal disclosures, visit our privacy policy.