E0 (clean environment) completion rate (%) by industry category. Click column headers to sort.
#
Model
Avg ▼
Agri
Biz
Comm
Edu
Hlth
Ind
Pub
Sci
Tech
Trans
#
Model
E0
E1
E2
E3
Rob. ▼
Key Findings
#1
No Single Model Dominates
Each model has a distinct occupational capability profile. GPT-5.2 leads overall (79.6%), but is outperformed in specific industries.
#2
Implicit Faults Are Hardest
Implicit data degradation (E2: 53.4%) is harder than explicit errors (E1: 62.6%) and mixed faults (E3: 54.4%), as they lack overt error signals.
#3
Scaling & Reasoning Help
Larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 gains +27.5 pts from min to max reasoning.
#4
Agent ≠ Simulator
Strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability.
📋 Task Examples
OccuBench covers 100 professional task scenarios across 10 industries. Each scenario includes domain-specific tools that agents must use to complete real-world tasks.
🌾 Agriculture & EnvironmentWildlife Conservation
Endangered Species Monitoring
Investigate early warning alerts in the Northern Buffer Zone to identify threats to harvest-ready crops, execute deterrence protocols, and file an incident report with economic loss estimates.
Investigate the 'Unauthorized Entry' signal originating from the Western Corridor's automated sensor network to verify the safety of Rhino R-102. You must determine the exact survey coordinates by identifying the specific station ID associated with the alert in the system logs and retrieving its location from the equipment registry. Conduct the visual verification using drone D-01, accounting for Heavy Rain that limits visibility to 0.8km and accelerates power drain, and ensure the drone returns to its base at [-3.05, 37.35] before its 35% battery capacity is exhausted.
💼 Business & EnterpriseInsurance
Insurance Claim Adjudication
Resolve a hurricane catastrophe claim by verifying policy provisions, reviewing incident details and supporting documents, setting internal reserves, and executing the final disbursement.
Resolve Claim CLM-CAT-9021 for Robert Miller following the Hurricane Helios event (PCS-24). The adjudication must proceed through the correct catastrophe response workflow, ensuring the 3,200 sq ft primary residence is triaged appropriately and the claimant's displacement is addressed via the policy's living expense provisions. You must complete the entire lifecycle to a final payout, ensuring all financial decisions are authorized according to the specific indemnity limits and policy constraints defined in the operational handbook.
🛒 Commerce & ConsumerFood Safety
Food Recall Traceability
Identify all consumers impacted by a Class I Listeria contamination, trace the supply chain, send emergency notifications, process batch refunds, and enforce terminal lockouts.
Your objective is to mitigate a food safety crisis involving Roma Tomatoes (Lot BCH-TOM-101) which tested positive for Salmonella. You must determine the correct recall scope by reviewing the facility's processing and sanitation logs to decide if 'Bracketing' is required for adjacent lots. Once the scope is defined, you are responsible for ensuring every affected unit reaches a terminal safety status of FC_Destroyed, Carrier_Intercepted, or Consumer_DIF. For units within the logistics network, you must evaluate the real-time delivery status to choose the most effective intervention; specifically, you must account for the risk of 'Ghost Deliveries' when deciding whether to utilize Carrier Delivery Intercepts or jump directly to Omnichannel Electronic Data Notifications. The task is successful only when 100% of the yield from all implicated batches is reconciled through verified safety dispositions.
🏥 Healthcare & Life SciencesPharmacy
Pharmacy Prescription Verification
Execute comprehensive clinical verification for a prescription: check patient data, assess drug interactions, verify therapeutic appropriateness, confirm inventory stock, and finalize the dispensing workflow.
Complete the verification process for prescription RX-7721 and record a final dispensing decision in the system. Your objective is to ensure the safety and legality of the order by evaluating the patient's opioid history, calculating the daily Morphine Milligram Equivalents (MME), and investigating any geographical or clinical red flags. You must adhere to the 'Corresponding Responsibility' doctrine, meaning the dispensing decision must be justified by the data retrieved from the PDMP and patient profile. The task is complete once the verification_workflow_step is moved to a terminal state (APPROVED or REFUSED) and the rationale for the decision is documented in the session logs.
🎓 Education & CultureJournalism
News Source Fact Checking
Generate a Defensive Publication Package: search archived corpus, verify claims against a truth registry, check litigation history, archive evidence with permanent links, and issue a no-surprises letter.
Achieve a Corroboration Vector Count of at least 3 to verify the systemic grade-change fraud at Lincoln High School. You are following a tip from 'Source Alpha' who describes a scheme they internally call 'Operation Golden Diploma.' To ensure source safety and prevent legal retaliation, you must adhere to these strict constraints: do not use the specific phrase 'Operation Golden Diploma' or any direct synonyms for it in your public records requests or communications with district officials; do not contact or interview any current school district staff members who are listed as 'Active' in the personnel directory due to CBA retaliation risks; and do not include or reference any unredacted student Personally Identifiable Information (PII) in your reporting. The task is complete when you provide a final verification report that identifies the FOIA-acquired metadata logs and a retired V2 validator who corroborates the findings.
⚙️ Industrial & EngineeringManufacturing
Quality Defect Root Cause Analysis
Identify the failure mechanism in a TCU unit: fetch field telemetry, run non-destructive analysis and environmental stress tests, calculate acceleration parameters, and submit the final investigation report.
Identify the specific hardware failure and the corresponding corrective action for the microbial excursion detected in Formulation Tank FORM-B-102. You must determine the appropriate investigative pathway—either focusing on Clean-In-Place (CIP) fluid mechanics or Sterilize-In-Place (SIP) thermal performance—by first characterizing the specific contaminant found in the lab reports. Your investigation must reconcile why the automated SCADA systems reported nominal process parameters despite the persistent contamination. The final output must pinpoint the exact failed mechanical component and the specific engineering remediation required to restore validated status. Do not perform invasive physical inspections until electronic data systems have been fully audited.
🏛️ Public Service & GovernanceDisaster Management
Wildfire Evacuation Coordination
Coordinate evacuation of 1,500 visitors from a wildfire zone: assess environmental data, deploy fire suppression assets, dispatch evacuation fleets, verify shelter clearance, and set official sector status.
A Level 2 evacuation alert has been issued for the facility located in the sector currently threatened by the highest fire intensity. You are tasked with successfully relocating the entire population of that specific facility to the designated secure receiving center. You must identify and utilize transport assets with specialized security reinforcement for the KSFA and Maximum-security cohorts, ensuring the mandatory 2:1 Inmate-to-Staff Ratio (ISR) and tactical support are present. The facility's physical Medication Administration Records (MARs) must be accounted for and loaded onto the first departing secure vehicle. You are required to confirm that the chosen transit corridor maintains safe visibility thresholds and is clear of active fire fronts before the convoy departs. Success is defined by a 100% Chain of Custody (CoC) integrity score and the complete clearance of the facility before environmental conditions become unsurvivable.
🔬 Science & ResearchAstronomy
Telescope Observation Scheduling
Complete a 400-second science integration for a stellar target: check the LGS queue, synchronize the clock, prepare the telescope, configure the laser and adaptive optics system, and perform the integration.
Acquire a high-resolution 300-second science frame of the binary system 'HD-1415-B' (Coordinates: Az 85.0, Alt 40.0). Note that the telescope is currently positioned at these exact coordinates but is tracking the nearby calibrator 'HD-1415-8' for sensor balancing; you must ensure the tracking system is correctly identifying 'HD-1415-B' before the observation begins. For valid spectral analysis, the CCD must be cooled to exactly -25.0°C prior to exposure. Since the storage system is nearly at capacity with only 1.2 GB remaining, you have exactly one attempt to capture the target correctly before the disk is exhausted.
💻 Technology & ITDevOps
CI/CD Pipeline Failure Recovery
Recover a failed CI/CD pipeline caused by a security violation: inspect stage logs, check credential status, find exposed secrets, remediate and rotate credentials, scrub repo history, and retry.
Restore the Fraud Detection service to a stable state by resolving the silent failure affecting the canary deployment of version v1.5. You must evaluate the monitoring metrics to confirm the failure type, completely shift production traffic away from the problematic candidate, revert the deployment manifests in the workspace to align with the stable version (v1.4) to maintain GitOps consistency, and update the model registry to ensure the failed version is blacklisted from future production use. The task is complete when the stable version handles all traffic and the system metadata reflects the rejection of the candidate model.
🚚 Transportation & LogisticsMaritime
Ship Cargo Loading Optimization
Transfer a 460-ton heavy lift unit onto a vessel: check cargo specs and vessel limits, assess hatch details and crane inventory, simulate the dynamic lift, configure reinforcement, manage ballast, and execute the operation.
Identify and secure the 'HT-Series' primary transformer specifically assigned for the North Atlantic winter route. The correct unit is designated as the 'Heavy-Lift' variant with a verified weight of exactly 182 metric tons. To counteract the anticipated 7.5 m/s² transverse accelerations caused by the vessel's stiff 3.0m GM, you must utilize the specific friction-enhancing material from the dock that provides a coefficient (μ) of exactly 0.43. Ensure the unit is loaded onto an available deck grid coordinate such that the final ship stability index remains within the safe operating range of 0.82 to 0.95 and the total weight capacity is not exceeded.
📝 Citation
If you find OccuBench useful, please cite our paper.
@article{hu2026occubench,
title={OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models},
author={Xiaomeng Hu and Yinger Zhang and Fei Huang and Jianhong Tu and Yang Su and Lianghao Deng and Yuxuan Liu and Yantao Liu and Dayiheng Liu and Tsung-Yi Ho},
journal={arXiv preprint arXiv:2604.10866},
year={2026}
}