Plotting the classics

Plotting the classics#

Note

This page has content from the Plotting_the_Classics notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.

In this example, we will explore statistics for: Pride and Prejudice by Jane Austen. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the public domain, meaning that everyone has the right to copy or use the text in any way. Project Gutenberg is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.

This example is meant to illustrate some of the broad themes of this text. Don’t worry if the details of the program don’t yet make sense. Instead, focus on interpreting the images generated below. Later sections of the text will describe most of the features of the Python programming language used below.

First, we read the text of of the book into the memory of the computer.

We have taken the liberty of downloading a copy of the text from http://www.gutenberg.org/ebooks/42671.txt.utf-8 to the data directory of the textbook — but you can check that the file data/pride_and_prejudice.txt is the same the copy you see on the web.

# Get the text for Pride and Prejudice.
# Don't worry about this code for the moment.
book_text = Path('data/pride_and_prejudice.txt').read_text()

On the last line, Python gets the text of the book from our copy at data/pride_and_prejudice.txt, and gives the read copy of text a name (book_text). In Python, a name cannot contain any spaces, and so we will often use an underscore _ to stand in for a space. The = in gives a name (on the left) to the result of some computation described on the right.

The # symbol starts a comment, which is ignored by the computer but helpful for people reading the code.

Now we have the text attached to the name book_text, we can ask Python to show us how the text starts:

# Show the first 500 characters of the text
print(book_text[:500])

The Project Gutenberg eBook of Pride and Prejudice

This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Tit

You might want to check this is the same as the text you see by opening the URL in your browser: http://www.gutenberg.org/ebooks/42671.txt.utf-8

Now we have the text in memory, we can start to analyze it. First we break the text into chapters. Don’t worry about the details of the code, we will cover these in the rest of the course.

# Break the text into Chapters
book_chapters = book_text.split('CHAPTER ')
# Drop the first "Chapter" - it's the Project Gutenberg header
book_chapters = book_chapters[1:]

We can show the first half-line or so for each chapter, by putting the chapters into a table. You will see these tables or data frames many times during this course.

# Show the first few words of each chapter in a table.
pd.DataFrame(book_chapters, columns=['Chapters'])

	Chapters
0	I.\n\n\nIt is a truth universally acknowledged...
1	II.\n\n\nMr. Bennet was among the earliest of ...
2	III.\n\n\nNot all that Mrs. Bennet, however, w...
3	IV.\n\n\nWhen Jane and Elizabeth were alone, t...
4	V.\n\n\nWithin a short walk of Longbourn lived...
...	...
56	XV.\n\n\nThe discomposure of spirits, which th...
57	XVI.\n\n\nInstead of receiving any such letter...
58	XVII.\n\n\n"My dear Lizzy, where can you have ...
59	XVIII.\n\n\nElizabeth's spirits soon rising to...
60	XIX.\n\n\nHappy for all her maternal feelings ...

61 rows × 1 columns

This is your first view of a data frame. Ignore the first column for now - it is just a row number. The second column shows the first few characters of the text in the chapter. The text starts with the chapter number in Roman numerals. You might want to check the text from the link above to reassure yourself that this comes from the text we downloaded. Next you see some odd characters with backslashes, such as \r and \n. These are representations of new lines, or paragraph marks. Last you will see the beginning of the first sentence of the chapter.

Note

This page has content from the Plotting_the_Classics notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.