Surely any language with a finite longest word can be made regular by having an automaton with paths to 26 states for all letters and then having each of those states go to another 26 states, etc., with states going to a looping non-final state whenever there are no possible words to be made beginning with the letters you have already gone through. Then make every state that ends on a word final.
8 Answers
The English language is regular if you consider it as a set of single words. However, English is more than a set of words in a dictionary. English grammar is the non-regular part. Given a paragraph, there is no DFA deciding whether it is a well-written paragraph in the English language. Of course, it can say whether each word is an English word or not, but it can not judge whole paragraphs.
- 4,784
- 1
- 13
- 36
Expansion of my comments to narek Bojikian's answer:
When people talk about natural languages such as English not being regular, they're usually talking on the level of grammar (syntax) rather than individual words. For instance, English has centre embeddings: you can build sentences of the form "the mouse escaped", "the mouse the cat chased escaped", "the mouse the cat the man owned chased escaped" that are grammatical and are arbitrarily long. These are similar to the language of matched parentheses "()", "(())" etc., and hence irregular for the same reasons.
However we can't conclude that English is irregular just from the fact that it has an irregular subset (for instance the language of all strings of parentheses does too, but that's clearly regular). The valid argument comes from the fact that all grammatical English strings of the form "(the [noun])*[verb]*" have matching numbers of nouns and verbs. The set of all strings of the form "(the [noun])*[verb]*" (call it $C$) is clearly regular, therefore if English were regular then the intersection of English and $C$ would be too. But the intersection is the irregular set of centre embeddings, hence English can't be regular.
One could argue that this argument isn't really correct: in practice the no-one ever uses centre embeddings with a depth of more than three, so the set of centre embeddings that actually exist is regular. This is a valid point, and is one of many reasons that the Chomsky hierarchy isn't always relevant to linguistics. But there are still a lot of situations where it is useful, for example saying that natural languages are context-free and then considering parsing them with pushdown automata suggests some possible models for how humans parse language.
- 520
- 3
- 7
It doesn't really matter whether English (either words, or complete sentences) is a regular or say context free language or not. What matters is that it is very, very hard to produce a state engine or a grammar of a reasonable size whose language is reasonably close to the English language.
But let’s say we are told that English sentences are not allowed to have more than a billion letters. So it is a regular language. Does that knowledge gain us anything? No, it doesn’t.
- 32,238
- 36
- 56
(Making my comment into an answer as requested by Ben I.) English is so nebulous that it is not a regular language for the trivial reason that it is not even a well-defined language. It is just like asking why mathematics is not a regular language.
This may seem like a non-answer, and to some extent it is, because every question regarding English depends on the definition of "English", and it is undeniable that nobody can ever define "English" 100% precisely. Sure, if you choose to define English as some language that is generated by a context-free grammar, then of course you can prove things about it. For example, if English permits arbitrarily deep center embedding, then you might be able to use the pumping lemma to show that it is not regular, but necessarily the proof will depend on the exact definition of English.
In any case, it occurs to me that you may have a misunderstanding about the term "word". In normal (non-technical) usage, an English word is a string of letters (sometimes including hyphens too). In theoretical CS, "word" simply means a string of symbols. If we treat English as a set of grammatically valid sentences, then each English sentence is made up of English words, spaces and punctuation, but these words are not the same as the words of a formal grammar that generates English (even if there is one).
- 728
- 7
- 13
Using the terms as they are used in formal computer science, it is indeed true that any language with a finite longest word is regular. Here a language is a set of words over an alphabet, and a word over an alphabet is a sequence of symbols from the alphabet.
So if the language "English" is the set of words over a 26-symbol alphabet whose membership is determined by e.g. having a definition in the Oxford English Dictionary, then English is a regular language.
However, in the context in which the question arose the language "English" was probably intended to be the set of words over an alphabet which also includes space and punctuation marks, with membership determined by something like "being an utterance which a native speaker might make". Formalising that membership would be a challenge in itself. Is "Colourless green ideas sleep furiously." a word in English? What about "'Twas brillig."? But the odds are that it would require some notion of backreference to ensure correct correspondence of gender, plurality, or even name in the presence of anaphor or cataphor.
I note, as an aside, that I am given to understand that Swiss German, considered at the level of the grammatical sentence, has a correspondence mechanism which makes it not even context-free.
- 2,102
- 10
- 15
English is not a regular language not because it's irregular, but because it's not a language.
(In the formal sense, that is.)
The common claim from CS/formal-logic folks thae English (or any other human language) is not regular is based on nestability of grammatical constructs. Same idea as "language of balanced-parentheses strings is not regular", clearly provable by the pumping lemma. The problem here is that arbitrarily-nested grammatical constructs are not members of the language, because beyond some (indefinite but clearly existant) limit they're no longer intelligible.
Rather, English (and other human languages) aren't regular languages because they're not languages. There is no rigorous definition of what is or is not a member of the language. For some strings of words, most users of the language are in agreement as to whether or not they're in the language, but for plenty of others, nobody agrees, and for many, even most individuals are not sure.
- 267
- 1
- 8
English, at the word level (no sentences, let alone grammar, punctuation or paragraphs), is regular provided we assume there are a finite number of words (no matter whether there are 1 million, 2 million, 10 million or even more, as long as it is finite). If nothing else, this is due to the fact that all finite languages are regular. It is fairly easy to construct an NFA with a start state and epsilon transition to the start state of every word. Each of these states would then have a single character to move it to a next state, and any other moves to a dead state. Of course this isn't the best possible construction, and there are ways to group together common suffixes with epsilon transitions from valid prefixes (eg. skiing is valid, kayaking is valid, doging is not), but that is all irrelevant. Also, it is possible to write a regular expression for English words. Simply union together all the words- eg. airplane + avocado + ... + ball + ... + zebra + ... It would be a hell of a long expression, but is valid nonetheless.
English at the sentence level (never mind syntax, punctuation, sentence order etc.) is not regular, and this is easy to show using the pumping lemma. Consider some story, novel, article etc. that is larger than the number of states in our supposed DFA. Even if there are a billion states, surely we can find some English text with more characters than that. Now, we must be able to find some subsection that can be repeated twice, thrice ... infinitely many times and the piece still remains in valid English. It is nearly impossible to find this (even if y is a space, two spaces are considered a typo). Of course there are obscure examples that are debatable (eg. The tree grew bigger and bigger, The tree grew bigger and bigger and bigger ... ). It is debatable here how many times repeating "and bigger" would be acceptable. Even if the answer is infinite, it still does not mean English is regular.
The pumping lemma doesn't say there merely exists some w longer than p that can be split into xyz, it says that EVERY single w longer than p must satisfy this condition. Surely I can find another piece written in valid English that is long enough, but does not contain any of the obscure debatable examples (like "and bigger"). Applying the pumping lemma here clearly shows English is not regular.
- 21
- 1
Interesting conversation.
I think that the grammar of English is not regular, because the grammar was defined as programming language are defined.
But natural spoken English (or any other language) is regular.
The example that breaks regularity is {a^n b^n | n >= 0}, but I think that no human being is able to speak so well English that he/she can say a phrase similar to a^1000000 and b^1000000, and there does not exist any human being that can say if the person that say that it was correct of not.
So, I think that the regular language L = {a^nb^n | n<1000000} union a^1000000 a* b^1000000 b^*, is a regular language that will cover any real English conversation in this context.
So, grammatical English is not regular. But natural spoken English is regular.
IMHO