💾 Archived View for dioskouroi.xyz › thread › 24995696 captured on 2020-11-07 at 00:50:50. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

The Unreasonable Syntactic Expressivity of RNNs

Author: freediver

Score: 134

Comments: 34

Date: 2020-11-05 04:37:52

Web Link

________________________________________________________________________________

albertzeyer wrote at 2020-11-05 12:55:01:

This is quite interesting.

In the beginning, there is this statement (assumption):

when processing natural language, we can bound the maximum number of things we’ll need to store on the stack

Why is that? Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth? But that's not strictly true, just highly unlikely, right? In principle, for every sentence of length N, I can easily add a word, and have length N+1. In a similar way, I can also always make the stack deeper.

This discussion reminds me in general about whether RNNs are turing complete or not. There is a famous paper stating that they are (On the computational power of neural nets, Siegelmann & Sontag, 1992). However, that uses infinite precision rational numbers for weights and activations. Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.

Some interesting discussion about this is also here (own question, and also has some own answer, among others):

https://stackoverflow.com/questions/2990277/how-useful-is-tu...

canjobear wrote at 2020-11-05 16:48:50:

> Is it because there is a limit N and all natural sentences ever are less than N words, and thus you also have a bound on the stack depth?

It's true that you can't really find an N such that a natural language sentence with stack depth N is OK, but a sentence with stack depth N+1 is ungrammatical. But in practice, the stack depth of real sentences is small [1], and sentences that require deep stacks quickly become impossible for humans to understand.

For example,

Stack depth 1: The rat ate the cheese.

Stack depth 2: The rat [that the cat chased] ate the cheese.

Stack depth 3: The rat [that the cat [that the teacher liked] chased] ate the cheese

Stack depth 4: The rat [that the cat [that the teacher [that the student hated] liked ] chased] ate the cheese.

English speakers usually can't handle depths 3 or higher.

This kind of phenomenon has caused a lot of controversy in linguistics. There does seem to be some sense in which the stack-depth-4 sentence is grammatical English, even though basically no one can understand it. But for the purposes of practical NLP systems, this phenomenon means that if you can model stack depth of, say, 10, then you'll be able to model almost all the naturally-occurring sentences.

[1]

https://www.jstor.org/stable/40057996

dragontamer wrote at 2020-11-05 18:01:54:

But English speakers make games and songs out of this practice.

>> There's a fleck on the speck on the tail

on the frog on the bump on the branch

on the log in the hole in the bottom of the sea.

Which is a proper grammatical sentence. The key to the song is building it up. The song starts with: "There's a hole in the bottom of the sea", and the song builds the sentence up. Indeed: having a grossly complicated sentence is the "Fun" of the song, but that doesn't change the fact that its grammatically correct (and actually understood by the listener / singers).

skulk wrote at 2020-11-05 18:23:52:

The examples in GP deepen and pop the stack before the sentence is over. Your example only deepens the stack-- it's like tail recursion vs non-tail recursion. I think that's what makes it a special case of deeply nested grammatical structure that's easy to understand.

There are no weird sequences of words "hated liked chased" that require backtracking to parse.

canjobear wrote at 2020-11-06 02:42:46:

Exactly, "a fleck on the speck on the tail on the frog on the bump on the branch on the log in the hole in the bottom of the sea" only has stack depth 1 as long as you can do a kind of tail call optimization (aka left-corner parsing with an arc-eager transition, see

https://www.aclweb.org/anthology/C92-1032.pdf

)

pixl97 wrote at 2020-11-05 18:13:55:

Is it only understood because of the particular build up?

Anything could be explained if it's broken into a ton of steps, with each step adding a new level, but very little daily language works like that.

jhardy54 wrote at 2020-11-05 16:03:51:

> Once you have finite precision (standard floats), it has finite memory. Then stack depths are bounded of course, and also it is not Turing complete anymore.

My understanding is that Turing machines assume infinite memory -- could you help me understand how this differs? I'm guessing that I'm not understanding your point, but if "finite memory" is a disqualifier then Turing machines aren't Turing-complete.

albertzeyer wrote at 2020-11-05 16:24:51:

Yes, for Turing machines (TM), or a machine which is as powerful as a Turing machine, you need infinite memory. A Turing machine always infinite memory (the tape is infinite). This is an important property of a TM. Once you have a finite tape only, it is strictly less powerful. That means, for any model / machine with finite memory, it is not as powerful as a Turing machine, i.e. not Turing complete.

So, (standard) RNNs with finite precision (standard floats) have finite memory, and thus are not Turing complete.

This paper (On the computational power of neural nets) shows that RNNs with infinite precision are Turing complete. But in practice, we never model RNNs that way. There are other ways we can add external (conceptually infinite) memory to a RNN. E.g. see neural turing machines (NTM) and differentiable neural computer (DNC).

gizmo686 wrote at 2020-11-05 16:10:09:

I believe that the insights is that for practical purposes, you can restrict yourself to a bounded stack; under the assumption that a large enough constant means that you will not run into an input large enough to overrun your limit.

thesz wrote at 2020-11-05 08:08:38:

No word about convolutional sequence modeling:

https://openreview.net/pdf?id=rk8wKk-R-

(PDF)

It is even more unreasonably effective, outperforming already unreasonable effective recursive neural networks on reasonable spectacular number of tasks.

Der_Einzige wrote at 2020-11-05 12:24:54:

When will this be reviewed? Says it's still under review at ICLR...

eli_gottlieb wrote at 2020-11-05 14:29:33:

Looks like it ended up as an ICLR 2018 workshop paper.

https://www.iclr.cc/Conferences/2018/Schedule?showEvent=511

jerome-jh wrote at 2020-11-05 08:29:19:

I read it as: RNN's (well actually FSTM's) can simulate a type of FSM's with a memory efficiency not thought possible.

kubanczyk wrote at 2020-11-05 09:12:31:

Don't you think that the plural nominatives without an apostrophe are less ambiguous? I mean _RNNs_ or _FSTMs_.

It's the OED's and CMoS's recommendation:

https://english.stackexchange.com/a/56010/15831

jerome-jh wrote at 2020-11-05 10:33:24:

OK thanks for the advice. I will do that from now on. I notice there is an exception for single letters and single digit numbers.

canjobear wrote at 2020-11-05 15:58:51:

RNNs looks to me like an acronym with a lowercase part. Like “RNN simple” or something. The apostrophe is less ambiguous in this context.

7373737373 wrote at 2020-11-05 12:45:49:

I'm surprised that genetic algorithms on resource-bounded general purpose virtual machines have not surpassed neural nets in their problem solving ability.

Is the computational advantage of neural nets on GPUs so dramatic?

cpgxiii wrote at 2020-11-05 15:52:04:

It comes down to gradients. When you have useful gradients to work with backprop on a GPU is vastly faster and more directed than GAs. One could uncharitably say that all of the impressive results from NNs in the last decade are the result of this ability to throw large amounts of data into highly parallel training, rather than new theory about the expressiveness and capabilities of NNs.

When you don't have or can't use a gradient, GAs become the go-to tool, for optimizations (if only because you don't have much in the way of other options).

king_magic wrote at 2020-11-05 12:48:42:

GAs can be way, way, way more compute intensive as your population size increases.

That said, I use GAs for optimization and find them to be an incredibly useful tool for various optimization tasks - just generally not for learning weights of NNs, where a GPU + backprop generally reigns supreme (or at least works well enough/much faster than a GA).

pchiusano wrote at 2020-11-05 17:32:59:

You might be interested in "compact GA" which only requires keeping the % of 1s (or 0s) at each bit position.

I suspect that with some more engineering and attention from people doing ML stuff, GA-style algorithms can be made just as memory and space efficient as gradient methods, while giving better results and being more widely applicable.

Here's a post on this:

http://pchiusano.github.io/2020-10-12/learning-without-a-gra...

woah wrote at 2020-11-05 16:24:21:

With a genetic algorithm you have to try a bunch of weight variations and see which one works best. With backpropagation you can calculate one derivative and find out which variation will work best in one step. It’s hugely more efficient.

pchiusano wrote at 2020-11-05 18:02:38:

GA-style search isn't actually taking multiple samples to decide on a new point to move to - it's taking multiple samples to decide on a smaller _region_ to focus on. This can be more efficient than backprop, depending on how "easy" it is to tell via sampling which subregion has better performance.

Here's a post on this:

http://pchiusano.github.io/2020-10-12/learning-without-a-gra...

Traditional GA has a practical problem for training large models - keeping a large population of model weights in memory is not feasible. Like if you had to keep 1000 variations of the GPT-3 model weights in memory for training that's no good. Though people have ideas for how to solve for that as well (again, see the post I linked).

gmadsen wrote at 2020-11-05 14:13:05:

yes, extremely.

parallelism like that isn't possible with GA

Der_Einzige wrote at 2020-11-05 17:19:32:

You can use genetic algorithms for training the weights of a neurL network. This is called neuroevolution and it's competitive on reinforcement learning because gradients are harder to find there...

GreenHeuristics wrote at 2020-11-05 07:17:25:

What's unreasonable about it?

marton78 wrote at 2020-11-05 07:27:29:

It's an hommage to

https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

The title has become a meme just like "... considered harmful".

kubanczyk wrote at 2020-11-05 09:20:07:

Actually, it's a homage to "The Unreasonable Effectiveness of Recurrent Neural Networks" which is a homage to "The Unreasonable Effectiveness of Mathematics in the Natural Sciences".

Nowado wrote at 2020-11-05 10:25:47:

Meme chains definitely are a progress compared to eponyms, but personally I hoped we would make a slightly bigger jump this time.

ve55 wrote at 2020-11-05 07:26:22:

It would be better read in a way such as 'unreasonably surprising'

neolog wrote at 2020-11-05 07:22:52:

Please add an RSS feed to your blog.

pedro1976 wrote at 2020-11-05 08:25:11:

As an intermediate solution you can use this feed [0] which is the result of rss-proxy [1].

[0]

https://rssproxy.migor.org/api/feed?url=https%3A%2F%2Fnlp.st...

[1]

https://github.com/damoeb/rss-proxy

adamnemecek wrote at 2020-11-05 07:40:19:

The ad nauseam repetition of memes like "considered harmful" or "unreasonable..." is too much.

FeepingCreature wrote at 2020-11-05 07:43:14:

I think you mean "Unreasonable repetition of 'unreasonable' and 'considered harmful' considered harmful."

After all, self-referentiality is all you need.

dunefox wrote at 2020-11-05 12:23:52:

I actually like this kind of title, so... `"Unreasonable repetition of 'unreasonable' and 'considered harmful' considered harmful." considered harmful.`