Browse parent directory
my_research/distributed_hosting_of_leaked_documents.html
2025-05-30
Distributed hosting of leaked documents
Disclaimer
- This document is written quickly and contains opinions I may change quickly, as I get new info.
- This document contains politically sensitive info.
Update
-
In the specific context of AI whistleblowers, I now consider most of these ideas of the original document low priority. Please go see the other document instead.
-
This document is more of a description of a platonic ideal world than something that I will build immediately. When I first wrote this I was thinking mostly of theory not practice.
-
This document assumes 3 stages. In practice, rarely are all 3 stages needed,
- In practice, sometimes someone powerful wants suppress information and many others want to read it. In this case, one can skip the medium-attention stage and directly go to high-attention stage.
- Example: Many high-profile leaks
- In practice, sometimes no one powerful wants to suppress the information. In this case, one can skip the low-attention stage and directly go to medium-attenion stage.
- Example: Most information currently on the internet
- In practice, sometimes someone powerful wants to suppress information but it is not immediately obvious who wants to read it or how to reach them. This is the case where 3 stages is most likely to be of use. I have to research more about this.
- Example: whistleblowers who are ignored at first and taken seriously later, crime-related evidence that the offender wants to suppress but the general public does not care about.
Main
What?
- This document describes how to setup distributed hosting and transmission of documents leaked by whistleblowers, in a way that reduces personal risk for everyone involved in the process.
- (If you squint hard, this document is also a blueprint for an internet with no delete button and by extension a society with no delete button. Once some information has been leaked it stays in public view forever.)
Why?
- Typically whistleblowing (such as with wikileaks or snowden leaks) incurs significant personal risk.
- Reducing personal risk to whistleblowers may ensure whistleblowing is highly likely to happen when an org doesn't have complete trust of all its members, forcing them to pay a secrecy tax (in Assange's words).
- I have my own personal viewpoint around which orgs I'd like to most enable whistleblowing on, although this will be general-purpose infra that can be used by anyone to whistleblow on any org.
Summary
- do SecureDrop / Signal but with increased security and >1000 servers all run by independent actors, and multiple independent dev teams
- do Internet Archive / CommonCrawl but also crawl rate-limited/banned stuff (like leaked/banned/copyrighted documents, and social media websites), also do >1000 crawls all run by independent actors. also some of these actors share the LLM embeddings.
Potential problems with distributing leaked documents
- Low-attention on the documents. Military-grade security. Documents circulated by people with technical skills and willing to run servers and maintain opsec as part-time job.
- (If nobody is trying to suppress the publishing of these documents, can skip this stage and directly post to clearnet. See next bullet point for this.)
- Ideally thousands of server operators exist. Some of them can choose for themselves special roles "redaction specialist" and "publisher". They can use public track record to prove to whistleblower and other operators that they can be trusted with this role.
- Whistleblower sends documents to an operator via SecureDrop or similar system or via hard disk dead drop. If redaction is required and they can't do it themselves for whatever reason, they send it to an operator who is a "redaction specialist" and has a good reputation.
- IMO PGP + airgap + dead drop may offer more privacy than PGP + airgap + Tor http request, as of 2025. This is my personal bias and could change in future if physical world DAQ increases (cctv, drones, gigapixel cameras on aircraft).
- I'm not very happy by some of the design choices made by SecureDrop. I'm looking into alternate solutions. It's possible I don't understand all of their choices. I have written a proper criticism of SecureDrop in a separate document.
- PROBLEM: convince thousands of people to become operators of SecureDrop or similar system (most important)
- PROBLEM: good infra, protocols, incentives to coordinate dead drops don't exist. Especially true if crossing a large geographic distance and multiple hops are required.
- This operator does redaction of any sensitive metadata or information, if required. They perform another hop here and send the documents to many other operators in the network using the same system.
- PROBLEM: need public guidelines on redaction, so anyone can do it. This ideally ensures there are thousands of potential operators right from the start.
- If any operators thinks the documents are not spam, they can attach a proof-of-work hash and resend it to many operators in the network using the same system.
- PROBLEM: need standard protocol for proof-of-work hashes. These could be static strings attached to documents, or generated at request-response time. (Tor, Brave, Proton all have separate implementations and they're all low difficulty hashes.)
- Eventually one of the operators who is a "publisher" hosts documents on a clearnet webserver for the public. This operator also posts a link to this webserver on a hard-to-censor social media platform such as 4chan or rumble.
- PROBLEM: need guidelines for what the hard-to-censor social media platforms in each country are.
- Medium-attention on the documents. Low security. Documents circulated by people with technical skills but not much free time.
- (If the documents are sufficiently important, a popular media org can publish them on their server, allowing the documents to skip this stage and directly go to high-attention.)
- Mirror a searchable version of docs to thousands of servers immediately
- It is important that automated mirroring happens before any humans read the content on the operator's clearnet server. Whoever first posts the document to clearnet is an obvious target for anyone who wants to take the documents down.
- Several mutually incompatible protocols exist for mirroring specific information. Example: torrent, ethereum blobdata, IPFS/filecoin, archive blockchain.
- Protocols to crawl and mirror the entire internet are still in development though. Example: WARC format, Apache Hadoop, the Internet Archive's crawlers, CommonCrawl's crawlers
- PROBLEM: need open source web crawler to crawl entire internet including any leaked docs/videos, and torrent links containing leaked docs/videos.
- OR: PROBLEM: need a standard protocol to only crawl websites and torrents that claim to have leaked docs on them (maybe they include a special flag in their readme/robots.txt, and proof-of-work hash to prove not spam.)
- PROBLEM: need open source plaintext extraction and embedding generation so that along with the raw html crawls (WARC), the plaintext and embeddings are also circulated in the same torrent. need standardised format (WARC-parquet?) that keeps some metadata just like WARC keeps metadata.
- High-attention on the documents. No security. Documents circulated by anyone.
- A popular media house publishes it to increase public attention
- Popular media house will do document verification. I'm assuming they won't face any significant challenge with this. May require metadata of the documents (how to get this??) or contacting the org whose docs got leaked.
- Popular media house will use embedding search functionality already provided, to figure out what is important to raise attention for.
- High-attention hard-to-censor social media to discuss the document in general public
- PROBLEM: need open source crawling and mirroring crawls of all social media
- I think actually doing distributed social media is too hard. Complexity of app ensures software developers who write the app are politically co-optible. What's easier to do is have distributed crawling and mirroring of a centralised site, so people in future can still view the consensus reached by users of the social media. If it ever gets taken down, someone can get a new server running (does not have to have content of old one).
- Which social media are high attention and hard-to-censor varies by country.
Summary of potential solutions
- persuade thousands of people to become operators of SecureDrop or similar system (most important)
- coordination for hard disk dead drops, including multi-hop hard disk dead drops
- proof-of-work hashes to prevent spam on the operators
- redaction guidelines
- open source web crawling
- flags and proof-of-work to only crawl some websites
- crawl and mirror leaked docs. crawl and mirror social media discussions.
- open source plaintext extraction, embedding generation
- standardise format to share extracted plaintext and embeddings
- guidelines for latest hard-to-censor high-attention social media
- to publish torrent link, maybe raw docs, and social media discussions
- guidelines must be country-wise and include legal considerations. always use a social media of a country different from the country where leak happened.
IMPORTANT: Need feedback from people who have actually worked with whistleblowers, to validate all hypotheses listed above.
Comments