## Ideas for two thesis problems in data science

**Natural Language Processing on math overflow**

You know about math overflow? It’s a site where grad students in math (or anyone) go and pose questions, and other people can answer them. There are lots of uninteresting, unanswered questions (like questions that are too easy and the person should be able to look up) and there are some really popular ones and some really dumb ones. Sometimes there are interesting ones.

Here’s a thesis idea, come up with a metric for “interestingness” and try to forecast the interestingness of a question from its language. Might as well also try to forecast its popularity while you’re at it. That way, if you make a good model, some of the more interesting questions will get higher in the queue and people will have a better time at the site.

**Genealogy graphs in different fields**

You know about the mathematics genealogy project? It shows everyone with a Ph.D. in math and considers them to be “descended” from their advisor in a family-tree like structure. For example, I’m here, and if I got up through my ancestors in 7 steps I get to Jacobi. Actually there are lots of ways to go up since a bunch of people have more than one advisor – I’m also 7 steps away from Poisson, 8 from Lagrange and Laplace, and 9 from Euler. This is probably not because I’m so cool but because there just weren’t many mathematicians back then- probably most people descended from Euler. And because we have this cool data set we can see if that’s true!

Here’s what I think someone should do, besides visualizing this graph in an awesome way (which by itself would be really cool, has anyone done that?). They should draw the graph for other fields as well and try to see if there are graph properties that characterize mathematics as distinct from other disciplines like Physics or Law or History.

Interesting questions. Any metric MathOverflow would obviously have to take into account the current “score” of the question poser. I remember one very popular question with many reads, comments and answers. The first comment said something like, this doesn’t really belong on MO, and if you weren’t XXX, I would vote to close.

As to genealogy, I wonder if the math graph is more highly connected than History, say. We don’t tend to have so many schisms, I would guess. I am always curious about features that distinguish math. I alternate between thinking that math must be like other areas of academia, and then realizing that it really is different – a conclusion I reach again and again.

To distinguish popularity (interesting to most) from your “interesting” (interesting to mathbabe), you might simply record the sources of the “like” flags. I think facebook does this, last time I looked. Then track the likers you like and follow them to their preferences.

This isn’t quite the algorithmic forecasting you had in mind — it requires a few drones reading everything and flagging — but outside of tags and categories, I’ll bet that any such algorithm would either be as cumbersome as a parallel processing learner (which would take a lot of time in learning, would learn imperfectly and would fail to forecast a lot of new interesting topics until it learnt them, and would have to depend on some initial drones as well anyway) or would predict too poorly to be worthwhile — not just directing you to dull topics sometimes, but also directing you away from some of the most interesting.

Actually, the really “easy” questions are rarely ignored. If they are extremely easy, they get closed, but if they are “easy” (meaning roughly “can be looked up in a textbook”) but reasonable, then they definitely get answered.

Mathbabe said: “…draw the graph for other fields as well and try to see if there are graph properties that characterize mathematics as distinct from other disciplines like Physics or Law or History.”

Could such a graph show the strong links of Physics and Chemistry, and the recently strengthening links of Biology, to the firmer foundations of Mathematics? If so, could such a graph show the logical fallacies that Economics, Law, History, Philosophy, Psychology and all other sociological, anthropomorphic disciplines have much of anything to do with Science, much less Mathematics.

Your suggestion is actually a good idea, Lou, for some areas of the social sciences, but not in all, and I get a little sensitive on this subject.

I agree that the social sciences often, all too often, step beyond math, but most natural sciences have to as well, since they are empirical, not, like math, axiomatic. Nonetheless, Chomsky’s first results were proofs in math, before he imported them into linguistics. Years ago, my prof in computer theory thought Chomsky was a mathematician before he learned he was a linguist. Most “theoretical” linguists trace back to Chomsky, so math would show up in two or three nodes on the tree no matter where you started in syntax. The Dutch consider semantics applied math logic. Learnability, natural language processing, all math heavy (as the Cathy’s subject heading implies).

The most suspect of all the social sciences (except maybe psych?) is economics and it is the most heavily mathematical. So there’s plenty math in social sciences, on the one hand, and on the other, a reliance on math doesn’t solve its empirical failures. Even political science uses game theory, and political science is more partisanship than science. Logic is usually housed in Philosophy departments. That’s where I studied Godel’s theorem — not much numbers in it, but it’s not finger painting. On the other hand, there is nothing mathematical in Derrida, and philosophers have to entertain him while he’s in the house too.

History doesn’t seem mathematical, so far as I can see. But even it has room for creative reasoning and juggling of analytical tools.

I might be botching this quote (couldn’t find it via quick google)… Frances Ysidro Edgeworth: “History, being one of the biological sciences, must, in its final formulation, be mathematical.”

I wouldn’t be so quick to conclude that there’s nothing mathematical in Derrida either…

Even novelists use math — David Foster Wallace. Every field fcuks math and enjoys it. Maybe the question is, why doesn’t math solve their troubles? Lies, damned lies and…

That reminds me of an idea I had. At a colloquium years ago an Ed person noted that students who really understood order of operations wrote “2 x + 5 x^2″ rather than “2 x+5x ^2″. I’ve exaggerated the spacing but the point is that the way people space their writing belies their conception of the subject. I wonder if you could do something with ML analysing handwriting for those kinds of patterns — would people space “smid gen” more often than “sm idgen”? And could knowing this improve either OCR or … uncover some interesting properties about the way people conceptualise things?

