Redact sensitive details from documents before transmitting or publishing them. Could be useful for whistleblowers/sources doing redaction themselves before they leak info, or hackers doing redaction before they leak info, or journalists or lawyers doing redaction on behalf of these people.
About script
Steps
Uses locally running tesseract OCR to do OCR on input image, and get words with bounding boxes
Uses locally running gpt-oss-20b via llama.cpp to reason about which of these words need to be redacted
Uses imagemagick to actually redact required boxes
Fully open source, runs fully locally. Only initial step of downloading repos and model weights needs internet connection.
Probably don't use this in production
No error handling
High-stakes redaction should ideally be done on an airgapped Tails machine by a group that is socially isolated from rest of society.
imagemagick is included in debian but not on Tails - probably imagemagick is secure, but it hasn't been vetted by Tails OS team, so they recommend using it in "additional software" if required. Tails OS includes GIMP and Inkscape instead.
Script
(tested on 2026-03-31 on macOS)
#!/usr/bin/env bash
{
printf '%s\n\n' 'Only output those entries that whose text contains the name of a person or organisation. Each line of output should contain exactly 12 tab-separated fields.'
tesseract "$1" - tsv quiet
} |
tee /dev/stderr |
jq -Rs '{model:"gpt-oss-20b",input:.}' |
curl -s -H 'Content-Type: application/json' -H 'Authorization: Bearer no-key' --data-binary @- http://127.0.0.1:8080/v1/responses |
jq -r '.output_text // ([.output[] | select(.type == "message") | .content[] | select(.type == "output_text") | .text] | join(""))' |
tee /dev/stderr |
awk -F '\t' '
BEGIN {
print "fill black"
print "stroke none"
}
$1 == 5 && $7 != "" && $8 != "" && $9 != "" && $10 != "" {
x1 = $7
y1 = $8
x2 = $7 + $9 - 1
y2 = $8 + $10 - 1
print "rectangle " x1 "," y1 " " x2 "," y2
}
' | magick "$1" -draw @- png:- > "$2"
Future directions
I expect that if I fail to stop the race to ASI in next 1-2 years, there will eventually be models that can do redaction zero-shot image-to-image, instead of using a pipeline like this. I tried deepseek-OCR-2, Meta SAM3, qwen-3-vl-8B, they're good at OCR but bad at giving accurate bounding boxes.
Subscribe
Enter email or phone number to subscribe. You will receive atmost one update per month