eli5 Entropy in syntax and Spearman’s rank correlation coefficient?

127 views

Hi,

I’m doing a project in school where we’re comparing entropy rates for how high the freedom of word order in a language is, to number of speakers. Our idea is that lower number of speakers would lead to an increased freedom/complexity in word order.

We’re comparing these 2 variables with Spearman’s rank correlation coefficient. But I don’t understand how it works. If we get a result that is closer to -1, as I understand it, that would mean that if 1 of the variables decreases, then the other one decreases aswell.

And as I understand that then, that would mean that a strong leaning towards -1 would indicate that smaller amount of people would correlate with lower freedom of word ordet? But that doesn’t make sense.

Am I getting something wrong?

In: 1

2 Answers

Anonymous 0 Comments

With correlation in general, coefficient > 0 means that higher variable A is associated with higher variable B, and coefficient < 0 means higher variable A is associated with lower variable B.

>If we get a result that is closer to -1, as I understand it, that would mean that if 1 of the variables decreases, then the other one decreases aswell.

So this is wrong.

If your variables are “number of speakers” and “freedom of word order”, then a correlation of -1 (< 0) indicates that as one goes up the other goes down, and vice versa.

But most of all, you should just plot the data, and the relationship should become more clear. Statistics are for formal hypothesis testing, not so much to gain initial insights in your data. That will also inform you about weird outliers and such that you should consider when intepreting statistics.

E: specifically, Spearman *rank* correlation looks at whether a high rank in variable A is associated with a high rank in variable B. Rank is a value’s place in the sorted list of total values for that variable. For instance, if you have four languages A through D with the following (totally made up) speaker counts…

– A: 3 billion

– B: 10 thousand

– C: 5 million

– D: 2

Then the ranks of A, B, C, D are 4, 2, 3, 1 respectively. You would rank the freedom of word order values the same way and the test then calculates how well the ranks line up between the two. This makes the test *nonparametric*, because it doesn’t look at the actual values but only their order (A’s 3 billion speakers does not factor into the result beyond just being the biggest value of the bunch).

Anonymous 0 Comments

If your two variables are “freedom of word order” and “number of speakers”, your rank-correlation coefficient would tell you:

* For a coefficient near +1, languages that are highly ranked in one are highly ranked in the other. That is, more free word order correlates with more speakers.

* For a coefficient near 0, there’s little relationship between the two, at least in terms of ranking.

* For a coefficient near -1, highly-ranked items in one are **lowly** ranked in the other. In other words, highly order-free languages are *less* spoken, and highly spoken languages are less order-free.

This is similar behavior to the more familiar Pearson (“r”) correlation coefficient, it just doesn’t depend on there being a linear relationship.