Takeaways from talks
It occurred to me that just watching one talk after another and saying "hey, that sounds useful" every now and then isn't doing all that much good. A couple of weeks later it will be forgotten. A better strategy seems to watch half as many talks, but take another hour condensing the takeaways from each. Takeaways that could be read in a couple of minutes on a dozen occasions over the next decade.
I'm not trying to make these notes accessible to anybody who reads them. On the contrary, the aim is for them to be as brief as possible, grounded in things that I know I'll never forget. But I think they might still be interesting and readable for my anonymous brothers and token sisters, sitting within their foreign neighbourhoods, combining pairs of ideas to forge gleaming new ones.
The five tribes of machine learning by Pedro Domingos
- Find it via the ACM Learning Webinars site.
- The talk was on 24 November 2015.
Central table of the talk:
|Evolutionaries||Evolutionary biology||Genetic programming|
- Inverse deduction is new to me. I interpret it as a search for a cause that produces a given outcome (all encoded in terms of names and tuples of names, which are rules, like in Prolog). A bit like a logical jigsaw puzzle, where we already know all the rules that might possibly matter, but the trick is to find explanations.
- Backpropagation is a way to solve the credit assignment problem. Which artificial neurons are responsible for the result? Then we can adjust them in case of error. It's the backpropagation of error.
- Genetic programming is a process to evolve programs as trees.
- Probabilistic inference, based on Bayes' Rule: getting the posterior as a result of updating the prior with the likelihood.
- Kernel machines are based on finding the nearest neighbour for a given example. Some of the examples support the boundary, which is an implicit hyperplane in a vector space. The rest are discarded.
Key problem solved by each:
- Symbolists — knowledge composition.
- Connectionists — credit assignment.
- Evolutionaries — structure discovery.
- Bayesians — uncertainty.
- Analogisers — similarity.
While my impression was that Hofstadter uses the word "analogy" in a much broader sense than vectors in a vector space, it's interesting to compare SVMs with deep nets. The nets keep something like partial imprints of examples, while SVMs keep a subset of unaltered examples. Is there something mid-way between the two? Wouldn't you sometimes wish to keep some outlier examples in your memory, and guard them from being blended into the common mix, while still letting mainstream examples contribute by dragging your representational stereotypes a bit in their direction? I guess that's what the idea of pooling largely addresses, actually.
To build up some structure of relationships among the five domains, I would first of all discard symbolists as too crisp to help initially, but they become useful after all the others are done. That is, when we have dredged up tens of thousands of symbols and commonsense rules using other methods, and certified each of them to be absolutely correct and not overgeneralised. Symbolic logic comes into play when we are faced with problems that humans consider hard, such as: design an algorithm to count particular combinatorial objects, given all sorts of theorems you know about various combinatorial objects, algorithms, divide-and-conquer strategies, etc. This requires quality hypotheses to be generated, new symbols to be defined automatically when a potentially useful pattern is recognised by the machine in its experiments. So, I think that after all of that machinery is evolved and learnt by other methods, it is this crisp symbolic manipulation which will produce the really amazing results, but I doubt it can bootstrap itself in most domains of interest.
Of the four left, evolution stands out as a great way to arrange functional building blocks, as long as there aren't too many involved at a time. And, more importantly, the goal should be relatively easy to reach. But when we think of evolution against deep nets, kernel machines and probabilistic methods, for me the last three are intertwined, while evolution stands apart and is orthogonal. That is: it should be possible to apply the idea of the genetic algorithm to all of those, whenever the conditions are favourable for it to make progress. If we have something that has parameters or structure, and works for most combinations of them (or at least a fair share of the parameter or structure space), then evolution or other space search techniques (simulated annealing, particle swarm optimisation, to name the trivially-implementable ones) are well worth a try.
Many years ago Marvin Minsky spoke at Irvine, I think, where, among many other ideas, he compared logic and neural networks. Paraphrasing, logic is appropriate in contexts where there are a few causes, each with an important effect. An analogy is a chair is supported by four legs, and each is crucially important. Neural nets work well where there are many causes, each with little effect. For example: a coat is supported on a tabletop by thousands of individual fibres, and if hundreds of them were removed or more were added, that wouldn't make much difference. And that's how I used to think about these things for a long time. But deep hierarchies, such as in deep learning (with alternating match and pool layers) or what we're seeing in the visual cortex, are a lot more like logic, it's just that there is a lot of it, and it's not built by hand. Small receptive fields at the bottom produce holistic (within the confines of the receptive field) classifications that are most consistent with the input patterns it has observed. There aren't that many causes. At the next layer these slightly more meaningful interpretations (e.g. "north-east edge", instead of a vector of pixels) get fused onto curves, and, higher up, into patterns, and so on. So, while, indeed the top classification "a bird" will be stable if some of the pixels at the bottom become occluded by a leaf, this doesn't happen in one step, with the hypothesis of "a bird" getting less weight, but is the result of quite a few chairs that stand on other chairs collapsing within the whole multi-level hierarchy. Still, not enough for the whole interpretation to break down. (Also, and besides the point, if the entire bird hides behind a leaf, and we can only see its tail, then the mountain of chairs collapses, except the one responsible for the tail part. And we have already bound it to the bird, so while this shape is quite ambiguous in itself, and could have been part of many other possible objects, our initial interpretation is locked in. Perhaps because we hold on to the highest "bird" chair, while all the others fall away from underneath it, until we are sure that the bird flew away. And while that chair stands, it prevents the "long thin straight rough-ended shape" chair from swaying towards an interpretation other than "bird's tail".)
As above, I think that kernel machines and deep nets are two examples out of the same paradigm. But what about probabilistic inference? Well, firstly, we're talking about graphical models: what other probabilities does it make sense to learn, other than the probabilistic relationships between pairs of random variables? And when we apply logarithms, the product of non-negative probabilities turns into additions and subtractions of log-probabilities, which can take on positive or negative values. Which makes them similar to neural nets. Clearly, however, they impose a particular nonlinearity from Bayes' Rule, while in neural nets a range of nonlinearities is used, typically justified by an analogy with neurons. It is known from Perceptrons that it can't work without a nonlinearity, but I don't know of any theory-based arguments supporting one particular nonlinearity over all others. From this discussion, it would seem like a good idea to have nonlinearities that would make it easy to implement the basic logic gates: AND, OR, XOR, NAND, NOR, XNOR, as well as the asymmetric ones ANDN, ORN, etc. (though these names are almost not used, and they don't seem to have any other). Inverters are central to logic design, and you can build anything out of NAND gates, but not out of AND gates. Though we basically have inverters — they are the negative weights. The pattern-matching stages are the AND stages, and the pooling ones are the OR stages.
It would be interesting to see how deep nets on the one hand, and graphical models on the other, canonically tackle a given simple nontrivial problem, each independently. Then we could compare how they work. Graphical models are probably more disciplined, while neural nets are more liberal and flexible. But the argument is that they are solving different problems: uncertainty and credit assignment. Suppose we are missing data, such as a missing letter in a word. Then we are definitely talking about probabilities. Or if a glyph is so blurred that we're assigning probabilities to the alternatives 'q', 'g' and '9'. That's probability too. But credit assignment is about such questions as "what do we call a chair?". In this context we wish to avoid overfitting and underfitting, but we might have perfect data. So, a semi-circle has zero probability of being a circle, but it might have quite some "circleness", which would feed into circle detector neurons, but that would be counteracted by some other input, such as cross-inhibition from a semi-circle detector. So, to my mind, the probability domain definitely has strong similarities to the neural nets domain, but if we treat it strictly as probability, then the area of application seems relatively narrow, because in most problems that we wish to solve we can always get enough information to know things for sure, with probability = 1 (e.g. distinguishing bread from cheese, given a very clear view). But as with symbolic learning, graphical models become indispensable when it comes to tricky uncertainty problems, such as in card games.
Anyway, I think this idea is brilliant, comparing the big established domains of machine learning, each with its own long history, major victories and global communities. I would probably have subconsciously avoided embarking on such an exercise myself, because it would seem that an enormous number of reservations have to be made in order to make sure that the choice of terminology does justice to everybody. (Such as: there is a lot more to connectionism than backpropagation.) But then we would never get anywhere if we couldn't work at different levels of abstraction.
My own conclusion is that the way to the holy grail, which is a common sense machine in a restricted domain ("artificial general restricted intelligence") is via deep hierarchies and evolutionary+ structure/parameter search strategies, as well as a great deal of inimitable human intelligence.
Steve McConnell: Stranger than fiction — case studies in software engineering judgment
- Find the video and the slides via the ACM Learning Webinars site.
- 28 January 2015 talk.
- Knowledge. Recall. Example: recall solutions used successfully for the task in the past.
- Comprehension = first level of understanding. Simple manipulations with knowledge. Example: explain why Scrum is not a design approach. Recall relevant details and fuse them together. Paraphrase, give your own description of something.
- Application. Example: use geometry methods to find how much paint you need for a basketball court.
- Analysis. Critical thinking. Example: follow a piece of code and figure out that it has an off-by-one error. Break a complex task into subroutines in a neat way.
- Evaluation / Judgement. Weighing things up without knowing what's salient a priori. Figuring it out. Example: given two approaches to solve a problem, which is best and why? Predict the likelihood of success of a project plan. Draw a holistic conclusion.
- Synthesis. Creative thinking. Given nontrivial requirements, build up a solution using tools you know. Build a team that will work well, given individual strengths and weaknesses. Generate a hypothesis.
Analysis is an over-developed muscle of technical staff. It's what distinguishes them from "ordinary people". Analysis paralysis: the critics in the mind savage new tentative ideas. Forest vs trees.
Decision tree diagram. Analysis: going deep down a path. Judgement: weighing up different subtrees.
People tend to adopt a favourite side early, which derails the rest of judgement activity.
Poor business judgement is typical. Good judgement is rare.
Four factors model to aid judgement:
- Size. Anti-economy of scale: trying a project 10× as large is >10× as hard. Schedule. Enough planning, given size? Appropriate disciplined QA, given size?
- Uncertainty. In: requirements; technology; planning.
- Human variation. Skills, experience, including technology; business area; dealing with uncertainty; quality level; management. Motivation.
Case studies: each gets red/amber/green on a traffic light for Size, Uncertainty, Defects, Human variation.
Case study: State of Oregon received $300M Federal grant for a state-wide system, while at the same time the nation-wide equivalent Healthcare.gov was funded $100M! Even so, Oregon project was scrapped and they are now using Healthcare.gov. Why? Chaotic work arrangements: no skilled engineering manager, build process not well defined, no peer review arrangements, source control arragements lacking, etc. These are all common-sense, as all software engineers get exposed to them during training. [Is that something we would expect from Oracle?]
Case study: budget and schedule blow-outs. Ends up in court, with only $150k recovered by client out of $2.8M spent. Different teams for each phase. Seventeen binders of use-cases produced by one — found to be unusable by the next. Staff flying from Chicago to Seattle every week. Tried to switch methodology mid-way: from Rational Unified Process to Extreme Programming!
- Biggest red light was Human variation. Others were alright.
Antidote: "Assume it will fail, and prove to me that it will work".
Case study: consolidate displays showing status info, 1 year, $2M. Intact 11-person team. Started with extensive prototyping. User demands turned a 2-message, 4-display initial requirements system into a 57-message 35-display system. That's big uncertainty, but mitigated by upfront focus on it. Defects approach: each component perfected before moving on. Project status, tasks displayed in a graphic format. Management used status how? To seek out project risks. Big success project.
Common theme: in large failed projects, the basic project dynamics were not obvious to the people involved, even in hindsight. [Well, there is a panoply of factors involved; everything seemed to go wrong; what's most salient?]
Refine Agile's attack on the waterfall: what's bad is design for speculative requirements. (Contrast with Big Design Up Front, which is not necessarily bad.) YAGNI = You Ain't Gonna Need It.
Software professionals tend to "flip the bozo bit" on higher-level management readily: i.e. classify them as bozos, therefore not take them seriously. [The claim is that high-level managers are selected based on their synthesis and judgement skills, rather than analysis. Where is Politics in Bloom's taxonomy anyway?]
What I think about all this
The underlying claim is that high scores on the Four Factors are highly correlated with project success. The relationship is of the AND type: one factor can down the whole project.
Now, Defects is not something I expected to see as a first-class issue. The other three are obviously central, but Defects? I guess that defects not being handled properly is something that can cripple a project that is otherwise in great shape, and so this has been included as a primary factor.
Another common theme I am seeing in the case studies is that software contractors manage to capture tons of funding, regardless of delivering rubbish. Salaries can't be taken back. Contracts filled with uncertainty become soft and construable by lawyers into any form, using logic games devoid of a touch with reality. The world looks different out of a wooden courtroom. How else can we explain the resolution of the case study? A client hires a contractor who doesn't deliver. So the court puts aside this blatantly conspicuous central issue? "Let's focus on the fine print instead. Maybe there's some awkward wording in there that we can latch onto." I don't know the actual details, but what could it possibly be, other than this? ... Or maybe the issue is more complex, ... if we look at the details carefully? Who should get the benefit of the doubt (uncertainty)? Theoretically, there could be provisions for the contractor to fool around. How do you define what's expected of them in this unique case? It's a legal document... Anyway, this is something we're stuck with, and even if there is a simple solution, it's not obvious. What's worse, the only way to convince people that a solution will work is to pilot it for an extensive length of time. And we don't have a culture of social experiments. We're too precious for that.
I would say the client's people must be integrated into the contractor's teams, and given power. Regardless of them not being expert software engineers.
By the way, I think this talk would have universal appeal:
- It would appeal to technical audiences, because it makes good points, that ring true to their experience, but they've probably never thought about things in this particular way. Which makes the ideas useful.
- I think it would also appeal to wider audiences who don't really get what's being said. That's due to the engaging style, tempo, speaker with a common-sense personality, who can point out ludicrously egregious failures without being insulting. People will feel they get it, even if they don't, because the speaker gives the ideas short, bright names (and colour codes), attaches meaning to the names by dwelling on each and illustrating it with examples; also, grounding things with case studies; boiling things down to nontrivial simplicity.