Because of FOIA Enron will Live Forever
or, how Siri got started by learning about insider trading
I started this substack with the intention to have five general post categories: practice pointers, legal developments, history, funny anecdotes and some politics.
I provided a practice point or two on body worn video, and I even pulled a few developing legal cases off the docket, like San Juan County’s litigation here in Washington. However, except for an older post about MOVE, Philadelphia and the FBI I have neglected my historical obligation. Let’s fix that.
Next spring marks the 24th anniversary of the cataclysmic demise of Enron Corp. A scandal of exceptional scope and impact, it was (at the time) the largest bankruptcy in American history. Enron’s business practices led to numerous individual criminal convictions, led to the enactment of the Sarbanes-Oxley Act and is one of the most consequential corporate governance developments in history.
After the dust settled, the Federal Energy Regulatory Commission (FERC) made the controversial decision to post online more than 1.6 million e-mails that Enron executives sent and received from 2000 through 2002. Citing public outcry, Congressional hearings and FOIA’s mandate to preemptively make documents available should a government agency believe them to be of particular public interest, FERC simply put them on a server and let the world see what they could see. FERC eventually culled the trove to remove the most sensitive and personal data (see PDF). Even so, the “Enron e-mail corpus,” as the cleaned-up version is now known, remains the largest public domain database of real e-mails in the world—by far.
As for Apple’s Siri, the DARPA-funded CALO project, which stands for “Cognitive Assistant that Learns and Organizes,” used the Enron dataset to train a prototype of the personal assistant that would later become the disembodied voice we all know.
A research ecosystem still hums around the disclosure because there is nothing else like it in the public domain. If it didn’t exist, research into business e-mails could be done only by people with access to big corporate or government servers. A short list of universities that maintain research sites on the corpus include:
University of Colorado, Boulder
etc. etc.
It’s safe to say that these emails will outlive us all, and if not for FERC’s decision we might not have them at all.
I use the dataset myself professionally when I run test environments for the City of Seattle’s forensic software. In its raw form, the Enron corpus is a vast set of folders containing 2.2 Gigabytes of messages in MBOXformat, all kept individually and numbered sequentially. Although FERC removed the home folders of those who explicitly requested it, as well as redacting some messages which were similarly flagged by their authors or recipients, the dataset is perfect for a sort of ‘in situ’ evaluation of our software’s capabilities.
And I’m not alone. Researchers at Sam Houston State (go Bearkats) used the Enron dataset in 2023 to evaluate natural language processing. They observed that by “using Latent Dirichlet Allocation and solving the information retrieval problem via finding document similarities in the topic space rather than doing it in the corpus vocabulary space” they received more relevant search results. In other words, they wanted to weight word clusters (“topic”) over sheer similarity of words across the email/document (“vocabulary”), which is particularly tricky when everyone is informal. It’s doubly tricky when we are thinking 2.2. Gigabytes of informal emails, like the Enron corpus.
As for me, I leave the language model / AI stuff to them. Usually I need just a few emails. Sometimes I need megabytes. I’ll never need Gigabytes (at least, I hope not). Either way, I am sure that once I am gone and buried there will be some litigation that finds a few hundred copies somewhere in our systems of old Enron emails.
I’m glad you enjoyed reading, and thank you for pointing out the missing link!
Kenneth Lay’s testimony was a little before my time but I had a professor who’d show it at the end of the semester years ago. I don’t know who told Lay to go the woe-is-me route but in hindsight it’s hilarious, right up there with Spencer Treadwell
You have an engaging writing style for these topics. And I really like your attention to the sourcing. It’s a very scholarly and neat way of presenting your research. I remember when the Enron scandal broke because I was working in the private sector as a database programmer at the time and the repercussions of its collapse reverberated throughout all levels of industry. I also recall Kenneth Lay testifying before Congress about how poor he now was, which became fuel for political cartoonists and late-night comedians for months afterwards. (One note, the University of Colorado, Boulder, was the only institution in the article that was missing a hyperlink.)