Sentence Segmentation

Posted on April 19, 2015 by Brian Jaress

Tags: code, python

Getting a computer to figure out where English sentences begin and end hasn’t been perfectly solved. There are libraries that do it, but they all seem focused on getting incrementally closer to how humans handle rare, ambiguous, or grammatically incorrect cases (and publishing academic papers along the way).

As a small part of a fun side project that’s just getting started, I needed sentence breaking specifically for well-edited text. For example, if someone forgot to capitalize the start of their sentence, I’d rather just move on than try to compensate with something that could misfire on valid sentences but makes a great research paper.

Besides, the whole project is for fun, and writing this part sounded fun.

Code

Here’s the core of the result in Python:

def ends_sentence(word, next_word):
    return (punctuated_as_end(word) and capitalized(next_word) and
            not (capitalized(word) and abbreviated_name(word)) and
            not acronym(word))

The basic approach was to spend a weekend turning the crank of Test Driven Design, starting with simple test cases, moving on to more sophisticated cases, then finally the test cases in How to Split Sentences by actual experts Olexiy Sliusarenko & Vsevolod Dyomkin.¹

You can check the full code to see all the tests and which ones of them actually pass (there’s no perfect solution, after all). The main helpers are below:

def punctuated_as_end(word):
    try:
        punct_suffix = re.findall("\W+$", word)[-1]
    except IndexError:
        return False
    return not END_PUNCTUATION.isdisjoint(frozenset(punct_suffix))

def capitalized(word):
    return re.match("[A-Z][^A-Z]*", letters(word))

def acronym(word):
    return re.match("[A-Z.]+\.$", word)

def abbreviated_name(word):
    ltrs = letters(word)
    if re.match("[A-Z]*$", ltrs):
        return True # initial
    return ltrs in ABBREVIATED_NAMES

As sometimes happens with TDD, the whole is more accurate than its parts. For example, abbreviated_name calls “I” an abbreviated name, but the other checks for things like a period and the next word being capitalized usually catch that.

As a final, non-TDD touch, the lookup sets were beefed up with obvious alternatives. (For example, there’s only a test for “Mr.,” but ABBREVIATED_NAMES has “Mrs.” as well.)

Sample Output

Common practice in natural language processing is to judge the code by running it on huge, tagged corpora, but that requires access to those corpora.

Instead, here’s output from running the code on some novels from Project Gutenberg. Below are excerpts from each novel’s output, with links to the whole output. Paragraph breaks were preserved, and the line breaks were changed only within each paragraph.

"What do I hear?
You, my dear master! you in this terrible plight!
What misfortune has happened to you?
Why are you no longer in the most magnificent of castles?
What has become of Miss Cunegonde, the pearl of girls, and nature's masterpiece?"

– Candide

And meanwhile his hunger grew and grew.
The only relief poor Pinocchio had was to yawn; and he certainly did yawn, such a big yawn that his mouth stretched out to the tips of his ears.
Soon he became dizzy and faint.
He wept and wailed to himself:
"The Talking Cricket was right.
It was wrong of me to disobey Father and to run away from home.
If he were here now, I wouldn't be so hungry!
Oh, how horrible it is to be hungry!"

– The Adventures of Pinocchio

"To whom dost thou talk of alighting or sleeping?" said Don Quixote.
"Am I one of those knights who take repose in time of danger?
Sleep thou, who wert born to sleep, or do what thou wilt:
I shall act as becomes my profession."

– The History of Don Quixote de la Mancha

Somehow that article came to me with a title calling them “Golden Rules,” which is catchy but not in the article itself.↩︎