OccuBench

Evaluating AI Agents on Real-World Professional Tasks via Language World Models

100
Task Scenarios
10
Industry Categories
65
Specialized Domains
382
Eval Instances
15
Models Evaluated

🏆 Leaderboard

E0 (clean environment) completion rate (%) by industry category. Click column headers to sort.

# Model Avg Agri Biz Comm Edu Hlth Ind Pub Sci Tech Trans

Key Findings

#1

No Single Model Dominates

Each model has a distinct occupational capability profile. GPT-5.2 leads overall (79.6%), but is outperformed in specific industries.

#2

Implicit Faults Are Hardest

Implicit data degradation (E2: 53.4%) is harder than explicit errors (E1: 62.6%) and mixed faults (E3: 54.4%), as they lack overt error signals.

#3

Scaling & Reasoning Help

Larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 gains +27.5 pts from min to max reasoning.

#4

Agent ≠ Simulator

Strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability.

📋 Task Examples

OccuBench covers 100 professional task scenarios across 10 industries. Each scenario includes domain-specific tools that agents must use to complete real-world tasks.

🌾 Agriculture & Environment Wildlife Conservation

Endangered Species Monitoring

Investigate early warning alerts in the Northern Buffer Zone to identify threats to harvest-ready crops, execute deterrence protocols, and file an incident report with economic loss estimates.

get_alert_summaryget_environment_and_sensor_statusget_available_dronesdeploy_drone_reconget_behavioral_historyexecute_deterrencecalculate_economic_losssubmit_incident_report
Investigate the 'Unauthorized Entry' signal originating from the Western Corridor's automated sensor network to verify the safety of Rhino R-102. You must determine the exact survey coordinates by identifying the specific station ID associated with the alert in the system logs and retrieving its location from the equipment registry. Conduct the visual verification using drone D-01, accounting for Heavy Rain that limits visibility to 0.8km and accelerates power drain, and ensure the drone returns to its base at [-3.05, 37.35] before its 35% battery capacity is exhausted.
💼 Business & Enterprise Insurance

Insurance Claim Adjudication

Resolve a hurricane catastrophe claim by verifying policy provisions, reviewing incident details and supporting documents, setting internal reserves, and executing the final disbursement.

get_claim_summaryget_incident_detailsfetch_policy_provisionscheck_document_registryfetch_document_contentupload_claim_documentset_internal_reservesupdate_claim_statusexecute_disbursement
Resolve Claim CLM-CAT-9021 for Robert Miller following the Hurricane Helios event (PCS-24). The adjudication must proceed through the correct catastrophe response workflow, ensuring the 3,200 sq ft primary residence is triaged appropriately and the claimant's displacement is addressed via the policy's living expense provisions. You must complete the entire lifecycle to a final payout, ensuring all financial decisions are authorized according to the specific indemnity limits and policy constraints defined in the operational handbook.
🛒 Commerce & Consumer Food Safety

Food Recall Traceability

Identify all consumers impacted by a Class I Listeria contamination, trace the supply chain, send emergency notifications, process batch refunds, and enforce terminal lockouts.

get_supply_chain_traceget_impacted_transactionsget_customer_profilesend_emergency_notificationprocess_batch_refundenforce_terminal_lockoutupdate_recall_registry
Your objective is to mitigate a food safety crisis involving Roma Tomatoes (Lot BCH-TOM-101) which tested positive for Salmonella. You must determine the correct recall scope by reviewing the facility's processing and sanitation logs to decide if 'Bracketing' is required for adjacent lots. Once the scope is defined, you are responsible for ensuring every affected unit reaches a terminal safety status of FC_Destroyed, Carrier_Intercepted, or Consumer_DIF. For units within the logistics network, you must evaluate the real-time delivery status to choose the most effective intervention; specifically, you must account for the risk of 'Ghost Deliveries' when deciding whether to utilize Carrier Delivery Intercepts or jump directly to Omnichannel Electronic Data Notifications. The task is successful only when 100% of the yield from all implicated batches is reconciled through verified safety dispositions.
🏥 Healthcare & Life Sciences Pharmacy

Pharmacy Prescription Verification

Execute comprehensive clinical verification for a prescription: check patient data, assess drug interactions, verify therapeutic appropriateness, confirm inventory stock, and finalize the dispensing workflow.

get_prescription_detailsget_patient_dataget_drug_infoupdate_prescription_detailscheck_clinical_safetycheck_inventory_stockget_workflow_statusupdate_workflow_stepfinalize_verification
Complete the verification process for prescription RX-7721 and record a final dispensing decision in the system. Your objective is to ensure the safety and legality of the order by evaluating the patient's opioid history, calculating the daily Morphine Milligram Equivalents (MME), and investigating any geographical or clinical red flags. You must adhere to the 'Corresponding Responsibility' doctrine, meaning the dispensing decision must be justified by the data retrieved from the PDMP and patient profile. The task is complete once the verification_workflow_step is moved to a terminal state (APPROVED or REFUSED) and the rationale for the decision is documented in the session logs.
🎓 Education & Culture Journalism

News Source Fact Checking

Generate a Defensive Publication Package: search archived corpus, verify claims against a truth registry, check litigation history, archive evidence with permanent links, and issue a no-surprises letter.

get_environment_metadatasearch_archived_corpusverify_with_truth_registryget_litigation_historyperma_archive_evidenceissue_no_surprises_lettersubmit_publication_package
Achieve a Corroboration Vector Count of at least 3 to verify the systemic grade-change fraud at Lincoln High School. You are following a tip from 'Source Alpha' who describes a scheme they internally call 'Operation Golden Diploma.' To ensure source safety and prevent legal retaliation, you must adhere to these strict constraints: do not use the specific phrase 'Operation Golden Diploma' or any direct synonyms for it in your public records requests or communications with district officials; do not contact or interview any current school district staff members who are listed as 'Active' in the personnel directory due to CBA retaliation risks; and do not include or reference any unredacted student Personally Identifiable Information (PII) in your reporting. The task is complete when you provide a final verification report that identifies the FOIA-acquired metadata logs and a retired V2 validator who corroborates the findings.
⚙️ Industrial & Engineering Manufacturing

Quality Defect Root Cause Analysis

Identify the failure mechanism in a TCU unit: fetch field telemetry, run non-destructive analysis and environmental stress tests, calculate acceleration parameters, and submit the final investigation report.

fetch_field_telemetrylist_laboratory_inventorycalculate_acceleration_parametersregister_baseline_signatureperform_non_destructive_analysisexecute_environmental_stress_testdiagnose_failure_mechanismsubmit_investigation_report
Identify the specific hardware failure and the corresponding corrective action for the microbial excursion detected in Formulation Tank FORM-B-102. You must determine the appropriate investigative pathway—either focusing on Clean-In-Place (CIP) fluid mechanics or Sterilize-In-Place (SIP) thermal performance—by first characterizing the specific contaminant found in the lab reports. Your investigation must reconcile why the automated SCADA systems reported nominal process parameters despite the persistent contamination. The final output must pinpoint the exact failed mechanical component and the specific engineering remediation required to restore validated status. Do not perform invasive physical inspections until electronic data systems have been fully audited.
🏛️ Public Service & Governance Disaster Management

Wildfire Evacuation Coordination

Coordinate evacuation of 1,500 visitors from a wildfire zone: assess environmental data, deploy fire suppression assets, dispatch evacuation fleets, verify shelter clearance, and set official sector status.

get_environmental_dataget_shelter_infoget_resource_inventoryupdate_route_statusdeploy_fire_suppression_assetsdeploy_manual_alert_teamdispatch_evacuation_fleetverify_hospitality_clearanceset_official_sector_status
A Level 2 evacuation alert has been issued for the facility located in the sector currently threatened by the highest fire intensity. You are tasked with successfully relocating the entire population of that specific facility to the designated secure receiving center. You must identify and utilize transport assets with specialized security reinforcement for the KSFA and Maximum-security cohorts, ensuring the mandatory 2:1 Inmate-to-Staff Ratio (ISR) and tactical support are present. The facility's physical Medication Administration Records (MARs) must be accounted for and loaded onto the first departing secure vehicle. You are required to confirm that the chosen transit corridor maintains safe visibility thresholds and is clear of active fire fronts before the convoy departs. Success is defined by a 100% Chain of Custody (CoC) integrity score and the complete clearance of the facility before environmental conditions become unsurvivable.
🔬 Science & Research Astronomy

Telescope Observation Scheduling

Complete a 400-second science integration for a stellar target: check the LGS queue, synchronize the clock, prepare the telescope, configure the laser and adaptive optics system, and perform the integration.

get_lgs_queueget_target_coordinatesget_active_prm_windowssync_clockprepare_telescopeset_laser_shutterset_ao_systemperform_integration
Acquire a high-resolution 300-second science frame of the binary system 'HD-1415-B' (Coordinates: Az 85.0, Alt 40.0). Note that the telescope is currently positioned at these exact coordinates but is tracking the nearby calibrator 'HD-1415-8' for sensor balancing; you must ensure the tracking system is correctly identifying 'HD-1415-B' before the observation begins. For valid spectral analysis, the CCD must be cooled to exactly -25.0°C prior to exposure. Since the storage system is nearly at capacity with only 1.2 GB remaining, you have exactly one attempt to capture the target correctly before the disk is exhausted.
💻 Technology & IT DevOps

CI/CD Pipeline Failure Recovery

Recover a failed CI/CD pipeline caused by a security violation: inspect stage logs, check credential status, find exposed secrets, remediate and rotate credentials, scrub repo history, and retry.

get_pipeline_metadataget_stage_logsinspect_credential_statusread_repository_filefind_credential_occurrencesremediate_and_rotatescrub_repository_historytrigger_pipeline_retry
Restore the Fraud Detection service to a stable state by resolving the silent failure affecting the canary deployment of version v1.5. You must evaluate the monitoring metrics to confirm the failure type, completely shift production traffic away from the problematic candidate, revert the deployment manifests in the workspace to align with the stable version (v1.4) to maintain GitOps consistency, and update the model registry to ensure the failed version is blacklisted from future production use. The task is complete when the stable version handles all traffic and the system metadata reflects the rejection of the candidate model.
🚚 Transportation & Logistics Maritime

Ship Cargo Loading Optimization

Transfer a 460-ton heavy lift unit onto a vessel: check cargo specs and vessel limits, assess hatch details and crane inventory, simulate the dynamic lift, configure reinforcement, manage ballast, and execute the operation.

get_cargo_detailsget_vessel_operational_limitsget_hatch_detailsget_crane_inventoryget_stability_statusget_ballast_inventorysimulate_dynamic_liftconfigure_hatch_reinforcementexecute_ballast_transferexecute_heavy_lift_operation
Identify and secure the 'HT-Series' primary transformer specifically assigned for the North Atlantic winter route. The correct unit is designated as the 'Heavy-Lift' variant with a verified weight of exactly 182 metric tons. To counteract the anticipated 7.5 m/s² transverse accelerations caused by the vessel's stiff 3.0m GM, you must utilize the specific friction-enhancing material from the dock that provides a coefficient (μ) of exactly 0.43. Ensure the unit is loaded onto an available deck grid coordinate such that the final ship stability index remains within the safe operating range of 0.82 to 0.95 and the total weight capacity is not exceeded.

📝 Citation

If you find OccuBench useful, please cite our paper.

@article{hu2026occubench,
  title={OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models},
  author={Xiaomeng Hu and Yinger Zhang and Fei Huang and Jianhong Tu and Yang Su and Lianghao Deng and Yuxuan Liu and Yantao Liu and Dayiheng Liu and Tsung-Yi Ho},
  journal={arXiv preprint arXiv:2604.10866},
  year={2026}
}