Improving the Perseus Vocabulary Generator (Greek)

Perseus Tufts has a terrific lemmatizer, a program that parses words in a text and organizes each of those unique words into an ordered list. The actual dictionary system and repeated words that the lemmatizer provides, however, makes the lists it produces practically unusable for students who want to upload the vocabulary list into a flashcard program like Quizlet or Anki. I have provided here a guide to get your computer to semi-automatically get rid of repeated words or ones that aren’t correct, and replace Perseus’ definitions with Logeion’s.

One caveat about this word list that I should mention: since one of the first things the Lemmatizer does is remove the accents from the words, when the accents are put back in at the end, the program will produce any possible word with those letters. So if the word βίος is in the passage, the vocabulary list will give both the word for life (βίος) and bow (βιός) as separate entries, even though the word for bow never appeared in the text. This is not very common though and the student can either delete the extra word or keep it in their flashcard program, being made aware that the accent changes the meaning of the word.

Make a copy of this excel spreadsheet. You will use this later. Go to Perseus’ vocabulary generator for Greek, found here, and select the text you want a vocabulary list for. I like to set it to “Alphabetical Order” and to include ALL words.

Copy the whole page (command A) and paste it into the “Sandbox” tab of your copy of the excel spreadsheet.

Delete any extra stuff on the spreadsheet from the Perseus website, such as columns that include stats, dictionary definitions (we are going to add new dictionary definitions later), etc. Delete everything except the Greek vocabulary list.

Once you have also removed all the extra formatting, such as things being bolded and italicized, as well as putting it in the font and size you want, copy the list (command A) and paste it into this accent remover.

Copy Result and then paste it into the spreadsheet. But this time, instead of pasting in the tab at the bottom named “Sandbox,” paste it into the tab named “TEXT” under the column named “TITLE.”

Once you paste the text into the “TITLE” column, you will notice that the next column will be automatically filled in. You will notice some say #N/A, ignore it.

Make sure the code in column B covers all of the word list. If it doesn’t as seen below, click the corner of the box and drag it down.

Add back in the accented words from Sandbox. Paste that list into the column named “FRONT.” NB: the columns “SHORT_DEFINITION” and “FRONT” will become your flashcards in Anki.

Now it’s time to remove those pesky duplicates, particularly those #N/As. Highlight columns A, B, C. Click data, Data Cleanup, remove duplicates. Select only Column B and click remove duplicates.

Download the “TEXT” spreadsheet as a CSV file, save it to your Desktop. Then in Anki, make a new deck and name it the text you are using. In this case, “Lucian’s Deorum Concilium.” Use whatever naming system you wish!

Once you open up Anki, click file, import and select that CSV file you want to import.

Click the tab that says “semicolon” and change it to “comma.” Set the deck to the one you made previously. In this case, “Lucian Deorum Concilium.” Then, make sure that the Front setting corresponds with the tab that says “FRONT,” and the back that says “SHORT_DEFINITION.” Make sure existing notes say “Update.” Then, click import. Go to “Browse” for that deck, and delete any proper names and you’re done!

In the near future, we will post a guide similar to this for Latin.

Leave a comment