Here’s how I do it (roughly based on the distributed proofreaders website).

I open up the PDF and text document. I arrange the PDF so I can only see one column of text. I then position the text document next to the PDF. I zoom into the PDF so that the text on the PDF lines up with the text in the text document. I try to limit my view of the documents to about 30 lines. I found that if I had too much text, my eyes would spend too much time wandering to find the next line.


Common errors to look out for:

1) The copyright symbol ‘©’ needs to pasted back into its appropriate place. The OCR program either ignores the symbol or makes a wrong guess about what the character is. With single film entries, the copyright symbol will before the producer (production company) name. With episodes in a series, the symbol will be before the date.

2) Check the date. It should be in the format DateMonthYear. The Date and Year are numerical. The Month is a three letter abbreviation. There are no spaces between date, month or year. 29Jun59 and 13Oct66 are examples. Be aware that the OCR program often will confuse numbers and letters.

2) Check the registraion number. It should be in the format RN00000. RN is a two-letter registration code (LP, MP, MU, etc). The number is generally five digits. Occasionally, the number is four digits. There is no space between the registration code and number. LP29010 and MU7703 are examples. Be aware that the OCR program often will confuse numbers and letters.

2) Fix line spacing. I’ve been leaving a line between film entries. With series or episodes, I don’t put a line break between each episodes.

3) ‘See’ is listed below its entry. The OCR program occasionally subdivides a column into two columns and places “See” or an episode number below the entry.

No Responses to “Editing the Catalog of Copyright Entries OCR files”