eDiscovery's "Unitization" Discourse and the Covid Vaccine
Texas' Northern District enters the fray
On January 3, 2023, the Freedom Coalition of Doctors for Choice submitted a FOIA request to the CDC seeking:
All data obtained from v-safe users/registrants from the free text fields within the v-safe program for COVID-19 vaccines and the registrant code associated with each free text field/entry. (Note that all records from pre-populated fields, other than registrant code, can be excluded).1
(my emphasis)
The request sought expedited processing based on a “compelling need” to inform the public and the request also sought a waiver of fees pursuant to 5 U.S.C. § 552(a)(4)(A)(iii) on the basis that “disclosure of the [requested] information is in the public interest because it is likely to contribute significantly to public understanding of the operations or activities of the government[.]”
The CDC declined to fulfill the request, citing “the agency lacks the resources to manually review the data collected.” The CDC also expressed skepticism in their FOIA interim responses and later in court that the Freedom Coalition had a compelling need, albeit the CDC did not charge any fees in denying the request.
In response, the Coalition filed a complaint that then became Freedom Coal. of Doctors for Choice v. Ctrs. for Disease Control & Prevention, 2:23-CV-102-Z (N.D. Tex. Jan 05, 2024). The motions practice hit all the typical notes, however the CDC (through its U.S. Attorney) did file an additional motion to dismiss on the claim that the Freedom Coalition was incorporated in Amarillo, TX, only because they were venue shopping. The Northern District of Texas didn’t buy it, but the idea is provocative.
The crux of the FOIA issue was that these “free-text response entries” included unsolicited personally identifiable information (PPI) — like names, birthdates, and social security numbers — and the CDC lacked resources to manually review the data to segregate the non-exempt portions. FOIA requires that “[a]ny reasonably segregable portion of a record shall be provided to any person requesting such record after deletion of the portions which are exempt . . . .” 5 U.S.C. § 552(b). The CDC said that the data was not “reasonably segregable” because people who took the vaccine and accessed this specific software tool wrote in PPI, and it was shoulder-to-shoulder with disclosable text.
The same thing happens all the time with the City of Seattle. I recall one memorable instance where the City collected feedback about a specific Seattle Park. The topic was pickleball v. tennis. Many people put down their SSN, their home address, their family members and a bunch of other PPI as “evidence” that they weren’t a bot, that they had a unique connection to the neighborhood, and they had experience with the park etc. The City never asked for the PII, just like the CDC, and just like the CDC people put it in.
When the results came out people wanted to double-check the answers. Tennis people couldn’t believe other people liked pickleball. Pickleball people couldn’t believe other people liked tennis. The manual effort to scrub out people’s PII was far in excess of any benefit to the citizens requesting the data, but we did it.
To illustrate our shared problem, the City of San Francisco’s public disclosure records center makes available public record requests with the records that were produced. This is me, a guest user, navigating to their public record portal and searching for “SSN.” Someone was not clever in requesting records, and they inadvertently shared PPI with anyone who looks.
Unitization
The District Court sided with the Coalition. They reasoned that the CDC’s argument had a crucial unitization problem. The Court wrote,
[T]he Court must compare units of measurement, not merely naked numerals. Courts have considered myriad units of measure in the FOIA context. See, e.g., Long v. Immigr. & Customs Enf’t, 149 F. Supp. 3d 39, 56 (D.D.C. 2015) (considering “1.8 million songs on an iPod”) Goland, 607 F.2d at 353 (considering “84,000 cubic feet of documents”). Each of Defendants’ asserted cases considered a voluminous number of pages. However, Defendants report the free text entries in terms of characters. See, e.g., ECF No. 30 at 11. The free-text entries were limited to 250 characters each.26 For comparison, X (a/k/a “Twitter”) permits most users to tweet up to 280 characters.27 Thus, the parties functionally dispute Defendants’ ability to review and redact 7.8 million tweets, not pages.
“Unitization” means the assembly of a set. Document unitization is the process of determining where a document begins (its first page) and ends (its last page), with the goal of accurately describing what was a “unit” as it was received by the party and how was it kept in the ordinary course of business by the document’s custodian. Unitization includes Logical Unitization, like family relationships such as attachments; and Physical Unitization, like where each document begins and ends.
As evidenced, once the CDC (or more specifically, the U.S. Attorney’s Office for the Texas Northern District) lost the unitization argument then they effectively lost the case. They could not demonstrate why the court ought to evaluate each free-text form as a type of page rather than a type of tweet.
In the physical world, unitization is often unexamined and uncomplicated. Before electronically stored information, “unitization” was a twenty-dollar word for a one-dollar solution. If a business used physical paper documents, like in the 80s, then a single unit can be readily identified by staples, paper clips, or manila folders. In theory, discovery was straightforward. Both parties exchanged the paper documents, and if there were any nagging unitization issues, they were solvable.
Electronically stored information (‘ESI’) disturbed that often unexamined assumption that documents were documents and papers were papers and pages were pages. As the CDC is discovering, the unitization argument becomes half the battle.
The District Court doesn’t cite to the case because it’s from another district, but back pocket I have Blankenship v. Fox News Network. In Blankenship one defendant demanded production of ESI in TIFF image format with load files; and the others demanded production be made natively with metadata, inclusive of attachments and complete email chains (i.e., the logical unitization we spoke about above). Blankenship v. Fox News Network, LLC, Civil Action No. 2:19-cv-00236, (S.D.W. Va. June 14, 2021). Long story long, plaintiff chose to print the responsive documents by scanning the paper content into a PDF file. Taking these steps not only stripped all metadata from the production but sacrificed document unitization (i.e., three 10 page emails were scanned as a single 30 page PDF).
The district court in Blankenship speculated about why plaintiff was so unconcerned with unitization, but no final conclusions were entered into the record. What was entered into the record were the huge sanctions and findings of discovery violations. The district court reasoned that from the outset there was a pattern of inappropriate discovery behavior.
To turn our attention back to the District Court in Northern Texas, the District Court in Texas flatly rejected the CDC’s argument because of how far technology has come in creating segregable units of information, applying redactions, and then delivering them to the end user. District courts are finally catching up to the tectonic shift in how units of information work. When one side tries to cut the Gordian Knot and make unitization the opposing party’s problem —either by summarily sending off massive .pdf’s or flatly refusing to do it—then that argument scans as a long-winded discovery violations to the district court.
However, the present case does not entail particularly complex redactions. Rather, the redacted information is part-and-parcel of many automated programs utilized by law firms to screen large quantities of documents during discovery. Defendants may deploy automated review and redaction of the free-text responses, significantly reducing the workload for Defendants’ analysts. Indeed, the data is already stored in digital form. See ECF No. 29 at 13–14 (explaining how V-safe data is stored and transmitted for review).
[….]
Moreover, some 20 years after Public Citizen, Inc., the technology for automated document review has advanced to largely nullify concerns about manual review and even simple search parameters like name, birthdates, social security numbers, phone numbers, and email addresses — the types of PII at issue here. The automated processes acknowledged by both parties, expressly contemplated by FOIA, and mandated by E-FOIA, are capable of substantially reducing the costs and time required to review and redact for exempted PII.
The evolution of technology across the last twenty years is evident in the district court cites. Back in the day, and this is the point of the district court’s cite to Public Citizen, Inc., district courts would defer to agencies evaluation of technology. In Public Citizen, the court essentially gave the Department of Education brownie points for using automation because automation was at the time so new. The fact that DOE was able to search without manually reviewing loan applications captivated the opinion, which talked more about the technical aspects of the search than the rest of the case’s background combined.
Starting in the last ten years, courts are less and less inclined to defer to agencies. I see a pattern of district courts putting a burden of persuasion on agencies to not just use technology but use it effectively. See also Freedom Watch, Inc. v. Nat'l Sec. Agency, 783 F.3d 1340 (D.C. Cir. 2015). As the district court here proves, the answer is no longer automation is good (which is hardly controversial) but instead automation is required (more controversial). If an agency does not have an automated solution, the answer seems to be that’s the agency’s problem.
Part of this is of course legal so much as it is commonsense.I recall that not so long ago that when Jitesh Shetty and Jafar Adibi from the University of Southern California processed Enron’s emails my teacher at the time he was half in awe. 1.6 million e-mails, imagine that! These days that would be a fun thing for nerdy blogs like this to talk about, but it would not make the same ripples.
Because of FOIA Enron will Live Forever
After the dust settled, the Federal Energy Regulatory Commission (FERC) made the controversial decision to post online more than 1.6 million e-mails that Enron executives sent and received from 2000 through 2002. Citing public outcry, Congressional hearings and FOIA’s mandate to preemptively make documents available should a government agency believe them to be of particular public interest, FERC simply put them on a server
One last point, the CDC’s FOIA Office filed an affidavit that the request was also unreasonable because regardless of the segregation the FOIA Office had only thirteen analysts. The District Court, however, was unmoved.
While neither Plaintiff nor this Court dispute the Defendants’ alleged allocation of FOIA staff, “the number of resources an agency dedicates to such requests does not dictate the bounds of an individual’s FOIA rights.” Pub. Health & Med. Pros. for Transparency, 2023 WL 3335071, at *2 (citing Open Am. v. Watergate Special Prosecution Force, 547 F.2d 605, 621 (D.C. Cir. 1976) (Leventhal, J., concurring).
The district court might be unnecessarily combative. The cite to Watergate is provocative given all the many, less exciting cases where agencies have claimed (and lost) on the resource allocation issue. Even still, the dicta (and accompanying affidavit) is also further proof that agencies do not have the resources. Even a small law plaintiff’s law firm that mostly chases after ambulances would have more staff on call to review documents.
I also uploaded the full FOIA request, and accompanying letter, here: https://ufile.io/hh0rnzx7
Yeah, I’m unmoved also especially as the government had the internet before ‘the average Janice Q Public’ and principles of ‘records management’ have existed since Benjamin Franklin wrote the 1st Farmers Almanac!
I also live in Seattle and a block from a park’s pickleball court. I was surprised to learn from the estimable Capitol Hill blog that pickleball noise is a real issue for folks across the city who live within a block of a court. This whole topic that you are exploring is eye opening. Thank you, neighbor!