E-discovery and the public interest (part 2)

Home > data science, open source tools > E-discovery and the public interest (part 2)

E-discovery and the public interest (part 2)

May 9, 2013 Cathy O'Neil, mathbabe

Yesterday I wrote this short post about my concerns about the emerging field of e-discovery. As usual the comments were amazing and informative. By the end of the day yesterday I realized I needed to make a much more nuanced point here.

Namely, I see a tacit choice being made, probably by judges or court-appointed “experts”, on how machine learning is used in discovery, and I think that the field could get better or worse. I think we need to urgently discuss this matter, before we wander into a crazy place.

And to be sure, the current discovery process is fraught with opacity and human judgment, so complaining about those features being present in a machine learning version of discovery is unreasonable – the question is whether it’s better or worse than the current system.

Making it worse: private code, opacity

The way I see it, if we allow private companies to build black box machines that we can’t peer into, nor keep track of as they change versions, then we’ll never know why a given set of documents was deemed “relevant” in a given case. We can’t, for example, check to see if the code was modified to be more friendly to a given side.

Besides the healthy response to this new revenue source of competition for clients, the resulting feedback loop will likely be a negative one, whereby private companies use the cheapest version they can get away with to achieve the best results (for their clients) that they can argue for.

Making it better: open source code, reproducibility

What we should be striving for is to use only open source software, saved in a repository so we can document exactly what happened with a given corpus and a given version of the tools. It will still be an industry to clean the data and feed in the documents, train the algorithm (whilst documenting how that works), and interpreting the results. Data scientists will still get paid.

In other words, instead of asking for interpretability, which is a huge ask considering the massive scale of the work being done, we should, at the very least, be able to ask for reproducibility of the e-discovery, as well as transparency in the code itself.

Why reproducibility? Then we can go back in time, or rather scholars can, and test how things might have changed if a different version of the code were used, for example. This could create a feedback loop crucial to improve the code itself over time, and to improve best practices for using that code.

Comments (6)

Zathras

May 9, 2013 at 10:17 am

I still do not see the issue here. The algorithms are not for the court, they are for each side. In any case, both sides can work in whatever algorithms they want. If one side’s discovery algorithm is only looking for documents that benefit it, the other side will do the same. It’s an adversarial system, after all. Just because a document is deemed by an algorithm to be relevant does not mean it will be found as such by the court; the judge will make his or her own admissibility determination.

LikeLike
- Cathy O'Neil, mathbabe
  
  May 9, 2013 at 10:19 am
  
  how will the judge do that?
  
  LikeLike
  - Zathras
    
    May 9, 2013 at 10:22 am
    
    The judge will look at any document one side or the other wants to introduce into evidence. That’s it. These algorithms are just used for one side or the other to look for possible documents. They don’t decide what actually goes into evidence.
    
    By way of background, while I am not in the e-discover field, I am a former attorney who now does analytics, so I have an interest in the field.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      May 9, 2013 at 10:49 am
      
      Good points. I’m thinking mostly of when the two sides in question have asymmetric resources.
      
      So, say I’m a big fancy lawfirm, so I have the fanciest black box algo. You are poor lawyer, you ask me for all the docs relevant to pizza. I can subtly alter my algorithm to give you, first of all, anything vaguely related to pizza or, second of all, anything related to pizza that makes my case stronger. Or some combination. In the end, how are you going to figure out what happened? And how is the judge?
      
      Now you might say that this kind of behavior is illegal, which it is, but the point is you’re not really doing it, your personal black box “objective” algorithm provider is, which makes it a kind of control fraud, much harder to detect and prosecute.
      
      If the algos are open source, it will be easier for the poorer side to use them and to keep up technologically. And it would make those open source algos better.
      
      What do you think?
      
      LikeLike
    - albrt
      
      May 9, 2013 at 12:14 pm
      
      What Zathras says is correct – the judge is not relying on these systems to determine what evidence is relevant in the legal sense. The parties are just using the system to save themselves time and money by focusing on the most relevant documents in the ordinary sense.
      
      Having a well-understood system can benefit the parties if it means they can agree how to process documents, but one side does not get to decide unilaterally what type of searching is sufficient. I have been in situations where I told the other side I personally did a search for a particular word in g-mail or Outlook and then produced all the emails that came up. If the other side doesn’t agree that’s good enough, or if they think I’m lying, or if they think I don’t know how to do a proper search, then we either talk about it or fight about it.
      
      Big corporations used to bury the important documents in millions of irrelevant documents. In that situation, having these computer applications helps the little guy find the important ones, or figure out whether the big guy produced what he was supposed to produce.
      
      You are welcome to email me if you have questions, although I’m on jury duty today so I’m not in the office.
      
      LikeLike
    - Zathras
      
      May 9, 2013 at 2:12 pm
      
      There are two parts of the discovery process you mention here. First is the production of the documents. As albrt mentions below, parties do enormous document dumps and let people sort through them. However, these algorithms (so far) play a pretty minor role here–an enormous dump is done. The second part is the other side’s analysis of the documents. Here there might be an issue with asymmetrical resources. The access to algorithms is generally too expensive for small players, while for large players it can significantly cut costs. But even this part is not a huge issue, since they just would have the resources to do the more expensive, person-intensive way, and would do so. Besides, the small players in litigation aren’t going to have the pile of documents to produce anyway, so they larger side doesn’t need it anyway. These battles are typically fought on cases where it is big player vs. big player, so resource asymmetry is generally not a huge issue.
      
      LikeLike