Home | Search

2026-04-01

Very simple automated redaction script

Disclaimer

Quick Note
Contains politically sensitive info

What?

Redact sensitive details from documents before transmitting or publishing them. Could be useful for whistleblowers/sources doing redaction themselves before they leak info, or hackers doing redaction before they leak info, or journalists or lawyers doing redaction on behalf of these people.

About script

Steps
- Uses locally running tesseract OCR to do OCR on input image, and get words with bounding boxes
- Uses locally running gpt-oss-20b via llama.cpp to reason about which of these words need to be redacted
- Uses imagemagick to actually redact required boxes
Fully open source, runs fully locally. Only initial step of downloading repos and model weights needs internet connection.

Probably don't use this in production

No error handling
High-stakes redaction should ideally be done on an airgapped Tails machine by a group that is socially isolated from rest of society.
This won't run on tails securely:
- tesseract is included in debian but not on Tails - could be insecure Tails OS does not include any AI model yet.
- llama.cpp is not included even in debian - almost certainly insecure - most important. Tails OS does not include any AI model yet.
- imagemagick is included in debian but not on Tails - probably imagemagick is secure, but it hasn't been vetted by Tails OS team, so they recommend using it in "additional software" if required. Tails OS includes GIMP and Inkscape instead.

Script

(tested on 2026-03-31 on macOS)

#!/usr/bin/env bash

{
  printf '%s\n\n' 'Only output those entries that whose text contains the name of a person or organisation. Each line of output should contain exactly 12 tab-separated fields.'
  tesseract "$1" - tsv quiet
} |
tee /dev/stderr |
jq -Rs '{model:"gpt-oss-20b",input:.}' |
curl -s -H 'Content-Type: application/json' -H 'Authorization: Bearer no-key' --data-binary @- http://127.0.0.1:8080/v1/responses |
jq -r '.output_text // ([.output[] | select(.type == "message") | .content[] | select(.type == "output_text") | .text] | join(""))' |
tee /dev/stderr |
awk -F '\t' '
BEGIN {
  print "fill black"
  print "stroke none"
}
$1 == 5 && $7 != "" && $8 != "" && $9 != "" && $10 != "" {
  x1 = $7
  y1 = $8
  x2 = $7 + $9 - 1
  y2 = $8 + $10 - 1
  print "rectangle " x1 "," y1 " " x2 "," y2
}
' | magick "$1" -draw @- png:- > "$2"

Future directions

I expect that if I fail to stop the race to ASI in next 1-2 years, there will eventually be models that can do redaction zero-shot image-to-image, instead of using a pipeline like this. I tried deepseek-OCR-2, Meta SAM3, qwen-3-vl-8B, they're good at OCR but bad at giving accurate bounding boxes.

Enter email or phone number to subscribe. You will receive atmost one update per month

Comment

Enter comment

Home | Search

Very simple automated redaction script

What?

About script

Probably don't use this in production

Script

Future directions

Subscribe

Comment