Guest post: the age of algorithms

Home > Uncategorized > Guest post: the age of algorithms

Guest post: the age of algorithms

March 21, 2017 Cathy O'Neil, mathbabe

Artie has kindly allowed me to post his thoughtful email to me regarding my NYU conversation with Julia Angwin last month.

This is a guest post by Arthur Doskow, who is currently retired but remains interested in the application and overapplication of mathematical and data oriented techniques in business and society. Artie has a BS in Math and an MS that is technically in Urban Engineering, but the coursework was mostly in Operations Research. He spent the largest part of my professional life working for a large telco (that need not be named) on protocols, interconnection testing and network security. He is a co-inventor on several patents. He also volunteers as a tutor.

Dear Dr. O’Neil and Ms. Angwin,

I had the pleasure of watching the livestream of your discussion at NYU on February 15. I wanted to offer a few thoughts. I’ll try to be brief.

Algorithms are difficult, and the ones that were discussed were being asked to make difficult decisions. Although it was not discussed, it would be a mistake to assume a priori that there is an effective mechanized and quantitative process by which good decisions can be made with regard to any particular matter. If someone cannot describe in detail how they would evaluate a teacher, or make a credit decision or a hiring decision or a parole decision, then it’s hard to imagine how they would devise an algorithm that would reliably perform the function in their stead. While it seems intuitively obvious that there are better teachers and worse teachers, reformed convicts and likely recidivist criminals and other similar distinctions, it is not (or should not be) equally obvious that the location of an individual on these continua can be reliably determined by quantitative methods. Reliance on a quantitative decision methodology essentially replaces a (perhaps arbitrary) individual bias with what may be a reliable and consistent algorithmic bias. Whether or not that represents an improvement must be assessed on a situation by situation basis.

Beyond this stark “solvability” issue, of course, are the issues of how to set objectives for how an algorithm should perform (this was discussed with respect to the possible performance objectives of a parole evaluation system) and the devising, validating and implementing of a prospective system. This is a significant and demanding set of activities for any organization, but the alternative of procuring an outsourced “black box” solution requires, at the least, an understanding and an assessment of how these issues were addressed.

If an organization is considering outsourcing an algorithmic decision system, the RFP process offers them an invaluable opportunity to learn and assess how a proposed system is designed and how it will work – What inputs does it use? How does its decision engine operate? How has it been validated? How will it cover certain test cases? Where has it been used? To what effect? Etc. Organizations that do not take advantage of an RFP process to ask these detailed questions and demand thorough and responsive answers have only themselves to blame.

While a developers’ code of ethics is certainly a good thing, the development, marketing and support of a proposed solution is a shared task for which all members of the team must share responsibility – coders, system designers and specifiers, testers, marketers, trainers, support staff, executives. There is no single point of responsibility that can guarantee either a correct or an ethical implementation. Perhaps, in the same way that a CEO must personally sign off on all financial filings, the CEO of a company offering an evaluative system should be required to sign off on the legality, effectiveness and accuracy of claims made regarding the system.

Software contracts are notoriously developer-friendly, basically absolving the developer of all possible consequences arising out of the use of their product. This needs to change, particularly in the case of systems sold as “black box” solutions to a purchaser’s needs, and contracts should be negotiated in which the developer retains significant responsibility and liability.

As I think was pointed out, there is a broad range of analysis and modeling techniques, ranging from expert systems that seek to encode human knowledge, to heuristic learning system such as neural nets. While heuristic systems have the potential to ferret out non-intuitive relationships, their results obviously require a much higher degree of scrutiny. Part of me wonders how IBM and Watson would do at developing decision systems.

Extensive testing and analysis should be required before any system “goes live”. It is disappointing to hear that “algorithm auditing” does not seem to be a thriving business, and, depending on the definition of “algorithm auditing”, I may be suggesting even more. Perhaps “algorithm testing” would be a more attractive sounding service name. Beyond requiring an analytical assessment of underlying data requirements and assessment algorithms, systems should be tested using an extensive set of test cases. Test cases should be assessed in advance by other (e.g., human expert) means, and system results should be examined for plausibility and for sanity. Another set of test cases should assess performance with extreme (e.g., best case, worst case) scenarios to check for system sanity. Another possibility is “side by side” testing, in which the system will “shadow” the current implementation, either concurrently or in retrospect and the results will be compared.

Psychological and other pre-employment tests, described in Weapons of Math Destruction, are problematic in two ways. First is whether it is appropriate to conduct them at all, and second is whether they are effective in their stated purpose (i.e., to select the best prospective employees, or those best matched to the position in question). Certainly, competency testing is an appropriate part of candidate selection, but whether psychological characteristics are a component of competency is arguable, at best. At the very least, however, such testing should be assessed as to whether it predicts what it claims to predict, and whether that characteristic is emblematic of work effectiveness. How to conduct such testing would require some creativity. Testing could be conducted on an “incoming class” of employees, whether prior to hiring, or after hiring with the test results being sequestered (neither reported to company management nor used in any evaluation process). After some period (1 – 2 years), the qualitative measures of employee performance and effectiveness could be compared to the sequestered test results and examined for correlation. Another possibility would be to identify a disinterested company with employees performing similar work. (By disinterested, I mean disinterested in using the evaluative test in question.) Employees of that company could be asked to undergo “risk free” testing, with results again being sequestered from their employer. The quantitative test results could then be compared to the qualitative measures of employee performance and effectiveness used by that employer. Whatever one thinks of such testing, as Weapons of Math Destruction correctly points out, to the extent to which it is used, efforts should be made to test and improve its efficacy. To the extent that such testing is promoted by an outside party, that party should be ready, willing and able to demonstrate observed effectiveness.

An interesting alternative to a proprietary black box system would be what might be called a meta-system, a configurable engine which would allow its procurer to specify the inputs, weightings and the manner in which they are used to formulate a decision, perhaps offering a drag and drop software interface to specify the decision algorithm. Such a system would leave the fundamentals of the decision algorithm design to the purchasing company, but simply facilitate its implementation.

One must always be cautious the possibility of inherent bias in data. As a simple example, recidivism is most easily estimated by the proportion of released convicts who are re-arrested. But if recidivism is actually defined by the percentage of released convicts who return to criminal life, then the estimate is likely skewed in several ways. Some recidivists will be caught; others will not. For example, some types of crime are more heavily investigated than others, leading to higher re-arrest rates. Further, even among perpetrators of the same crime, investigation and enforcement may well be targeted more to some areas than to others.

As was pointed out during the discussion, being fair, being humane may cost money. And this is the real issue with many algorithms. In economists’ terms, the inhumanity associated with an algorithm could be referred to as an externality. Optimization has its origins with the solutions to problem in the inanimate world, how to inspect mass produced parts for flaws, how to cut a board to obtain the most salable pieces of lumber, how to minimize the lengths of circuit traces on a PC board. There were problems that touched on human behavior, scheduling issues, or traveling salesman type problems, but not to the extent that they ignored humane considerations. We are now to the point where we have human beings being compared to poisonous Skittles, and where life altering decisions of great import (hiring, firing parole, assessment, scheduling, etc.) are being subjected to optimization processes, often of questionable validity, which objectify people, view them as resources or threats, and give little or no consideration to the very human consequences of their deployment. Assuming that your good work can drive to this consensus, there is a fork in the road as to how it can be addressed. One way would be to attempt to implement humane costs, benefits and constraints into the models being deployed and optimize on that basis. The other is to stand back and monitor applications for their human costs and attempt to address them iteratively. Or, as Yogi said, you can come to the fork and take it.

Categories: Uncategorized

Comments (3)

howardat58

March 21, 2017 at 5:06 pm

This is very sound advice.

LikeLike
Steven McKay

March 22, 2017 at 9:51 am

Regarding point #4, in my experience there usually is a single person responsible for results generated by implementing an algorithm. I really like the idea of creating some guidelines – even a check list – for what keeps an algorithm relatively ethical. Such people worry a lot about how the performance of the algorithm will differ in market vs as designed and often pay particular attention to how consumers will react.

LikeLike
hypergeometric

March 24, 2017 at 5:36 pm

I agree that having a substantial test engineering activity for any algorithm or algorithm implementation should be a stark prerequisite before they are given any responsibility. There is a long and established tradition of doing this for systems which might overtly endanger people’s lives, such as air traffic control, airborne and space controls, medical devices, and electrical generating station controls.

Key to this activity is including the test organization in the writing of formal requirements for the system under development. This is essential in order for the requirements to be both objectively testable, and to be testable within a reasonable budget. For example, I once was a test engineer for a system whose requirements document demanded it achieve a classification success rate having a couple of 9’s past the decimal point. I indicated to management that such a testing programme would likely go on for years. They relented.

This is particularly crucial if the system under test is trained using machine learning (“ML”) techniques rather than developed as more-or-less conventional software. In the latter case, components are open to inspection. While proprietary systems developed with ML might have diagnostic and visualization components which provide insight into their behavior and why things are done, a lot of the power of ML-derived systems are because a lot can be specified without the attention to detail conventional software demands. However, that does not build confidence when the system under test is being used to effectively make important decisions. If anything, an ML-derived system demands an even more challenging test programme, including fuzz testing of scenarios. Test plans and the performance of the system ought to be available to customers for inspection, perhaps under NDA.

Ultimately, as the NASA Columbia Accident Board (“CAIB”) declared, the test organization needs to be independent of the organization responsible for the product delivery, a recommendation which is seldom true, even in aerospace testing today.

Finally, models of the deployed system ought to have enough diagnostic recording built in so performance in the field can be monitored while in use, and guidelines for protecting private information for these purposes devised and built into the procurement contracts. Only in that way can the developers know whether the product is being used outside of its testing domain or is erring in some way. Perhaps liability for gross deviations ought to be clearly and indivisibly tied to the developer of the product.

LikeLiked by 1 person