2026-04-01

# Very simple automated redaction script

Disclaimer
 - **Quick Note**
 - **Contains politically sensitive info**

## What?

 - Redact sensitive details from documents before transmitting or publishing them. Could be useful for whistleblowers/sources doing redaction themselves before they leak info, or hackers doing redaction before they leak info, or journalists or lawyers doing redaction on behalf of these people.

## About script

 - Steps
   - Uses locally running tesseract OCR to do OCR on input image, and get words with bounding boxes
   - Uses locally running gpt-oss-20b via llama.cpp to reason about which of these words need to be redacted
   - Uses imagemagick to actually redact required boxes
 - Fully open source, runs fully locally. Only initial step of downloading repos and model weights needs internet connection.

## Probably don't use this in production

 - No error handling
 - High-stakes redaction should ideally be done on an airgapped Tails machine by a group that is socially isolated from rest of society.
 - This won't run on tails securely:
   - [tesseract is included in debian but not on Tails](https://packages.debian.org/stable/graphics/tesseract-ocr) - could be insecure Tails OS does not include any AI model yet.
   - [llama.cpp is not included even in debian](https://tracker.debian.org/pkg/llama.cpp) - almost certainly insecure - most important. Tails OS does not include any AI model yet.
   - [imagemagick is included in debian but not on Tails](https://packages.debian.org/stable/imagemagick) - probably imagemagick is secure, but it hasn't been vetted by Tails OS team, so they recommend using it in "additional software" if required. Tails OS includes GIMP and Inkscape instead.

## Script

(tested on 2026-03-31 on macOS)

```
#!/usr/bin/env bash

{
  printf '%s\n\n' 'Only output those entries that whose text contains the name of a person or organisation. Each line of output should contain exactly 12 tab-separated fields.'
  tesseract "$1" - tsv quiet
} |
tee /dev/stderr |
jq -Rs '{model:"gpt-oss-20b",input:.}' |
curl -s -H 'Content-Type: application/json' -H 'Authorization: Bearer no-key' --data-binary @- http://127.0.0.1:8080/v1/responses |
jq -r '.output_text // ([.output[] | select(.type == "message") | .content[] | select(.type == "output_text") | .text] | join(""))' |
tee /dev/stderr |
awk -F '\t' '
BEGIN {
  print "fill black"
  print "stroke none"
}
$1 == 5 && $7 != "" && $8 != "" && $9 != "" && $10 != "" {
  x1 = $7
  y1 = $8
  x2 = $7 + $9 - 1
  y2 = $8 + $10 - 1
  print "rectangle " x1 "," y1 " " x2 "," y2
}
' | magick "$1" -draw @- png:- > "$2"
```

## Future directions

 - I expect that if I fail to stop the race to ASI in next 1-2 years, there will eventually be models that can do redaction zero-shot image-to-image, instead of using a pipeline like this. I tried deepseek-OCR-2, Meta SAM3, qwen-3-vl-8B, they're good at OCR but bad at giving accurate bounding boxes.
