E-discovery and the public interest

Home > data science, modeling, open source tools > E-discovery and the public interest

E-discovery and the public interest

May 8, 2013 Cathy O'Neil, mathbabe

Today I want to bring up a few observations and concerns I have about the emergence of a new field in machine learning called e-discovery. It’s the algorithmic version of discovery, so I’ll start there.

Discovery is part of the process in a lawsuit where relevant documents are selected, pored over, and then handed to the other side. Nowadays, of course, there are more and more documents, almost all electronic, typically including lots of e-mails.

If you’re talking about a big lawsuit, there could be literally millions of documents to wade through, and that takes a lot of time for humans to do, and it can be incredibly expensive and time-consuming. Enter the algorithm.

With advances in Natural Language Processing (NLP), a machine algorithm can sort emails or documents by topic (after getting the documents into machine-readable form, cleaning, and deduping) and can in general do a pretty good job of figuring out whether a given email is “relevant” to the case.

And this is already happening – the Wall Street Journal recently reported that the Justice Department allowed e-discovery for a case involving the merger of two beer companies. From the article:

With the blessing of the Justice Department’s antitrust division, the lawyers loaded the documents into a program and manually reviewed a batch to train the software to recognize relevant documents. The manual review was repeated until the Justice Department and Constellation were satisfied that the program could accurately predict relevance in the rest of the documents. Lawyers for Constellation and Crown Imports used software developed by kCura Corp., which lists the Justice Department as a client.

In the end, Constellation and Crown Imports turned over hundreds of thousands of documents to antitrust investigators.

Here are some of my questions/ concerns:

These algorithms are typically not open source – companies like kCura make good money doing these jobs.
That means that they could be wrong, possibly in subtle ways.
Or maybe not so subtle ways: maybe they’ve been trained to find documents that are both “relevant” and “positive” for a given side.
In any case, the laws of this country will increasingly depend on a black box algorithm that is no accessible to the average citizen.
Is that in the public’s interest?
Is that even constitutional?

Categories: data science, modeling, open source tools

Comments (11)

Marcos

May 8, 2013 at 9:42 am

It looks like the dark side of the 4-color theorem proof … The algos could be fooled, of course; perhaps the best choice is to go open source and have different teams and approaches (Kaggle-like) to take a shot at it, but that raises the question of confidentiality (see CME x CFTC regarding HFT research).

LikeLike
Dylan

May 8, 2013 at 10:47 am

AIUI, it’s not really an “algorithm”. For each discovery case, they make queries, and then repeatedly adjust the queries based on reviewing the results. The result is a collaboration between the human and the computer, and the humans are ultimately liable for the results.

It’s also been going on for years, not sure why it’s in the news now.

I have a friend in NYC who does exactly this work, if you really want to find out more.

LikeLike
- Cathy O'Neil, mathbabe
  
  May 8, 2013 at 10:48 am
  
  It’s most definitely an algorithm! It’s just a supervised algorithm.
  
  I’d love to interview your friend – my personal email is cathy.oneil at gmail dot com, if you wouldn’t mind introducing us.
  
  Thanks!
  
  LikeLike
- Shannon Brown
  
  May 9, 2013 at 10:11 am
  
  This illustrates the deep confusion over emerging trends. Yes, older techniques use an iterative, key-word search winnowing (much like the card game of go-fish) process and might not be full, algorithmic-based methodologies. These techniques have been around for years and are based on basic regular expression, indexing, or simple information retrieval techniques. Granted, some might be implementing more advanced capabilities derived from natural language processing (NLP), ranking, and information retrieval, but at heart, this is probably what your friend uses. These older techniques certainly have a role in eDiscovery analysis but might, alone, not be adequate–or efficient.
  
  There is an entirely different method based on advanced machine learning, neural network, and natural language processing algorithms–an algorithmic system. An attorney-expert reviews a seed-set of documents and then trains a model based on the expert seed set analysis. That model is then applied to unseen documents in the document corpus to classify document–relevant, non-relevant, privileged, non-privileged. Iteration might be required during training (although some new systems help minimize iteration by allowing the computer to test hundreds or thousands of models for the best results) to identify the best parametric options, but the iteration here is related to the training process–not necessarily to the result set as with the older technologies. These newer algorithmic systems focus on classifying and winnowing using more subtle techniques–e.g., to retain the fidelity o the materials and the completeness of the final dataset.
  
  But note that neither of these techniques (nor others) are the whole story. The newer algorithms are classification based whereas the older algorithms are regex based. That means these techniques can be intertwined in a project. But trying to use any of these tools without full knowledge of their advantages and limitations can cause problems or aberrant results. (Think of this as using a screwdriver to drive nails or a hammer to drive screws. Yes, the wrong tool might eventually achieve the task, but do you really want the consequential damage as well?)
  
  LikeLike
Higby

May 8, 2013 at 11:04 am

Constitutionality would be based on rules of evidence for the local jurisdiction.

Public interest is another matter entirely, given the state of the court system and corporatism.

You probably want to read the chapters on how the corporations gained control of the patent courts in David F. Noble’s America by design : science, technology, and the rise of corporate capitalism (1977), especially the role of Henry Pritchett, erstwhile president of MIT, and others.

LikeLike
djorgovski

May 8, 2013 at 11:40 am

How is this different from using Google? It, too, uses algorithms trained by humans and optimized by your own queries to sift through lots and lots and lots of documents, and the results can be cherry-picked at the end. I bet that lawyers and their assistants do a lot more web searching than shuffling through paper records and books as is.

Even without computers, legalese can be a pretty effective black box. This sounds simply like a more efficient version of the existing practice.

I am not a lawyer, but it seems to me that this is most likely legal and constitutional, and, more importantly, it is inevitable in the exponential information growth world.

LikeLike
JR

May 8, 2013 at 11:45 am

The algorithms could certainly be wrong, as could human reviewers. Is there some reason to believe that the algorithms are more likely to be wrong than human reviewers?

Algorithms, like human reviewers, may be biased. For example, the human document reviewers would certainly desire to disclose less rather than more, depending on who they worked for. I’m not sure that an algorithm as opposed to a human makes it more of an issue. The issue is outsourcing work to third parties, but intense, yet sporadic work is always likely to be outsourced.

As for your later statements about the laws of this country increasingly depending on a black box algorithm – maybe at some point, but not as a result of this.

(From the point of view of the development of the law, this is less significant than the development of Lexis and Westlaw. If an opinion weren’t included in Lexis and Westlaw, at this point, for precedential purposes, it might effectively not exist; of course, this was also true, differently in the print era, so I am not sure it makes a real difference.)

(With respect to constitutionality, I am not exactly sure what concern you are raising.)

With regard to whether it is in the public interest, the real alternative, I suppose would be to have the DOJ develop its own system. It is hard to say whether that would be better or not.

LikeLike
- Higby
  
  May 8, 2013 at 11:49 am
  
  Or scholar.google.com — with some pretty powerful legal options that bring a standard case law library right to your fingertips.
  
  LikeLike
David Wees

May 8, 2013 at 1:37 pm

Do the people who rely on experts to analyze DNA evidence really understand how it was produced? Is the difference that the process for analyzing DNA is completely public so that anyone who wanted to could potentially challenge the evidence?

LikeLike
albrt

May 9, 2013 at 2:45 am

First, thank you for using and spelling “pored” correctly.

I’m a lawyer. Discovery is driven by the parties in all the US jurisdictions I know about. Each side decides for itself what they want to investigate, what they want to ask the other side about, and what they want to do with the results. You can ask the other side to produce a million documents, and then you can pay a million dollars for your lawyers to go through every page very carefully or you can leave the documents in boxes at a warehouse and never look at them at all. Both of those things actually happen, but usually it’s something in between.

This is generally true even if one side is the government, although there are exceptions. For example, sometimes government officials can force regulated entities to go through their own documents and produce reports, and then the government decides how closely to check the reports.

So the people on each side can choose to use computers to check the documents if they think that’s the best way to get the information they need at the lowest cost, and then they present whatever they find to the judge or jury. In some other systems the judge actually investigates, but in our system the judge does not.

That is the adversarial system we have. It is very expensive and inefficient, but it has the advantage of giving parties the opportunity to spend as much time and money as they want trying to make their points. Eventually one side wins or they settle. The result is based partly on who is right, partly on who had the best lawyers, and partly on unpredictable consequences of choices made along the way. It is probably better than a system of trial by combat or dunking.

LikeLike
Savanarola

May 9, 2013 at 3:48 am

The thing is, you were always reliant on the honesty and competence of the other side in answering discovery. Absent a subpoena, you never get to go over there and look through their documents yourself – your side issues explicit instructions about what you want, sight unseen and without knowing what kinds of documents they have and how they keep them – and they decide what to give you. And if you think they didn’t do a good/honest job, you file a motion to compel and fight about it.

I used to try antitrust cases, and the documents are indeed staggering. It is exactly the range of what is relevant in an antitrust case that makes defendants and plaintiffs alike balk at wading in. I wonder how the algorithm handles privilege: do live people (possibly in Bangladesh at this point) review the documents identified for production after the algorithm runs its course?

Regardless of who decides what to produce, or how they do it, my job is always to look at the documents and see not only the pattern of what is there, but the pattern of what is not. I’m no more concerned about a computer making the first cull than a bunch of first year associates who can’t even run the copier yet.

LikeLike