Human vs AI Multi-Solution Reasoning

100main puzzles
5puzzle types
170module trials
Exactsolver checked

Human Collection Plan

Main participants do all 100 main puzzles; module participants are separate and condition-balanced.

Cohort Participant IDs What each person sees Reason
Main M0001, M0002, ... 125 recruited; retained target 100; 3 practice + all 100 canonical puzzles + 3 checks Clean per-puzzle human solution distributions
Modules E0001, E0002, ... 500 recruited; each gets 3 practice + 50 units / 61-62 paid trials + 2 checks One condition per unit; pair and transfer sequences stay adjacent
  • Each displayed trial targets 100 counted records after exclusions.
  • The backend enforces globally unique usernames, supports same-PID resume, and stores platform IDs when provided.
  • Wrong submit attempts give feedback and stay on the same puzzle.
  • Skip and timeout count for the trial, but participants with five or more skip/timeout records are excluded and backfilled.
  • Participants may pause only between puzzles and must finish the assigned set.
  • Exports include raw, retained, primary 100, excluded, quota, payment, bonus CSV, audit log, timing, assignment IDs, and break totals.

Human Protocol

Timing, skip, exclusion, and payment rules shown inside the dashboard.

  • Each main participant does 3 practice puzzles, 3 attention checks, and then all 100 paid main puzzles.
  • Each module participant does 3 practice puzzles, 2 attention checks, and the full 50-unit battery with one condition per family/block.
  • Skip is disabled for 90 seconds; each trial times out at 120 seconds.
  • Wrong submit shows feedback and does not end the trial.
  • Skip and timeout count for quota, but participants with at least 5 such records are excluded and backfilled.
  • Participants can pause between puzzles, but must finish the assigned set.
  • Main pay is $12.00 base plus $0.10 per correct paid puzzle, estimated 60-75 minutes; module pay is $9.00 base plus $0.10 per correct paid puzzle, estimated 40-55 minutes.

Backend Status

Live SQLite collection state when served with study_server.py.

Loading backend status...

Research Questions

Distribution, preference, presentation, and context effects.

Distribution

  1. Are human and AI solution distributions different?
  2. Which group is more concentrated or more uniform?
  3. Do differences grow when search pressure is higher?
  4. Do differences depend on puzzle type?

Preference

  1. Do humans and AI prefer simpler-looking valid answers?
  2. Do visual tasks show corner, edge, or center bias?
  3. Do symbolic and visual tasks produce different preference patterns?

Presentation

  1. Does reformulating the same puzzle change the distribution?
  2. Do irrelevant visual cues shift choices?

Context

  1. Are answers to two related consecutive puzzles independent?
  2. Does puzzle A prime a strategy used on puzzle B?
  3. Does the same priming effect appear for humans and image-input AI?

Dataset Design

How each research question maps to a dataset, manipulation, and measurement.

Question Dataset / module Manipulation Main measurement
Human vs AI distribution Main dataset Same puzzle shown to humans and AI repeatedly Solution-ID frequencies, entropy, KL/JS distance
Puzzle-type dependence Main dataset 24, shortest path, grid placement, Minesweeper-lite, Mini Sudoku Effect size by puzzle type
Search pressure Main dataset features Different solution counts and solver-derived pressure features Distribution gap vs pressure bucket
Simple-answer preference Main dataset features Each solution has proxy simplicity / salience features Probability of choosing lower-cost or more salient answers
Spatial bias Spatial module Original vs left-right mirror, top-bottom mirror, and combined mirror Distribution equivariance plus left/right and top/bottom shift
Formulation sensitivity Formulation module 24-point original number order vs shuffled displayed number order Distribution change over abstract expression IDs
Irrelevant cue effect Cue module Mixed valid/invalid border-only highlight sets; no generic main CUE rows Probability mass on highlighted valid answers
Related-puzzle independence Pair module No-prime B vs unrelated A->same B vs related A->B; Sudoku uses mirror/180-degree transforms, sometimes with one local row/column swap Conditional dependence after subtracting B-alone and sequential baselines
Strategy transfer Transfer module Single-answer A primes one strategy; B has multiple valid answers plus wrong candidates Target-strategy choice in B with vs without prime

Current Structure

Automatic checks for counts and solution constraints.

  • Loading...

Module Counts

Cue, spatial, pair, transfer, and formulation module sizes.

Quick Visual Checks

Representative cue, spatial, pair, transfer, and formulation examples.

Still To Decide

Open design choices before the main experiment.

  • How many human participants and AI samples per puzzle are needed for stable distribution estimates.
  • Which contrasts are primary before looking at results.
  • How to calibrate human difficulty proxies after pilot timing and error data.
  • Whether 24-point should be reduced or kept as a smaller symbolic baseline.