utkarsh kamthankar

AI Benchmark Task Engineer

Remote, India 3+ yrs exp 82 · Excellent

About

Analytical AI Task Engineer blending advanced academic Data Science foundations with 3.5+ years of rigorous professional experience. Expert in designing and authoring high-quality multi-agent benchmark tasks that evaluate the analytical reasoning, coordination, and execution capabilities of advanced AI systems. Strong proficiency in SQL and Python (pandas, NumPy) for deep data analysis, scripting, and writing precise oracle logic. Proven ability to curate real-world datasets and create realistic synthetic datasets from messy multi-source files (CSV, JSON, logs, vendor assessments). Highly comfortable working with Docker to create reproducible evaluation environments similar to SWE-bench and Terminal-Bench.

Skills & Expertise (10)

Python Advanced

8.4/10

3.5

Years Exp

Pandas Advanced

8.2/10

3.5

Years Exp

NumPy Advanced

8.0/10

3.5

Years Exp

Advanced SQL Advanced

8.0/10

3.5

Years Exp

Docker Intermediate

7.6/10

3.5

Years Exp

Statistical concepts Intermediate

7.4/10

3.5

Years Exp

Anomaly Detection Intermediate

7.2/10

3.5

Years Exp

Debugging Intermediate

6.8/10

3.5

Years Exp

Analytical Reasoning Intermediate

6.5/10

3.5

Years Exp

Dockerfiles

Work Experience

AI Benchmark Task Engineer (Multi-Agent Systems)

Xlairs

Oct 2025 - Present

Design and author multi-agent benchmark tasks centered on complex data analysis workflows, testing how effectively AI systems cross-reference data and execute statistical computation. Write precise oracle logic and Python verification scripts that validate specific, verifiable analytical conclusions rather than generic summaries. Review task performance signals to ensure strong separation between weaker and stronger agentic systems across evaluation suites. Refine benchmark tasks continuously to improve determinism, clarity, difficulty, and scoring quality for leading foundation model companies.

Data Analyst & Python Specialist

Micro1

Aug 2025 - Oct 2025

Analyzed large, messy, multi-source datasets (CSVs, JSON files, survey results, and financial documents) to formulate non-trivial analytical questions with clear, specific answers. Created realistic synthetic datasets and curated real-world style datasets across domains such as finance, operations, and security analysis. Leveraged strong proficiency in SQL and Python (pandas, NumPy) to build workflows demanding contradiction detection and anomaly identification.

Data Scientist & AI Evaluator

Senquire Analytics

Jan 2025 - Oct 2025

Developed detailed decomposition guides that effectively split analytical work across specialist sub-agents (e.g., financial, technical, security, or operations analysts). Created highly reproducible evaluation environments using Python and Docker, including writing Dockerfiles, building container images, and debugging secure execution sandboxes. Applied a solid understanding of statistical concepts—averages, distributions, outliers, and correlations—to benchmark LLM data analysis capabilities.

AI Data Engineer

Wipro LineCraft.AI

Mar 2024 - Oct 2024

Extracted and structured data from messy logs and vendor assessments, standardizing inputs to measure how effectively AI models perform complex analytical workflows. Maintained deep familiarity with AI coding benchmark environments (SWE-bench, Terminal-Bench) to align internal testing methodologies with frontier LLM evaluation standards. Executed cross-referencing and statistical reasoning across multiple sources to establish baseline ground-truths for internal machine learning algorithms.

Junior Data Analyst

Shri Samarth Tools

Dec 2022 - Mar 2024

Extracted and analyzed operational reports utilizing SQL and Python to identify statistical anomalies and cross-reference messy transactional data. Authored reproducible data workflows that verified specific analytical conclusions for senior management, eliminating generic, unactionable summaries. Built a solid foundation in data analysis by tackling unstructured real-world datasets, consistently delivering verifiable outcomes on strict operational deadlines.