Bittorrent and Miro, a better Distributed Proofreading

ian   May 24, 2010   Comments Off on Bittorrent and Miro, a better Distributed Proofreading
If you spend some time in the ebook community you inevitably run into Distributed Proofreading, the collaborative proofreading group that supplies Project Gutenberg with high quality text versions of Public Domain books. They are a small community of dedicated editors doing good work. Unfortunately, they are also becoming irrelevant to most of the issues in the field because their multi-layer workflow is simply too slow. When organizations like Google are releasing a million books at once, it is hard to stay relevant when struggling to complete your project’s 20,000 book, even if those books, unlike Google’s, are meticulously verified and formatted. Scale and quality both matter and, if we structure it right, we can rework our communal digitization projects to get both. Currently, Distributed Proofreaders only releases books after spending weeks or months verifying that the text version matches the original page images. The industrial scanning efforts like Google Books and the Million Books Project generally skip verification entirely and distribute raw text versions with the photographic page images. This is perhaps the greatest key to their large size. Yes, they also paid for large scale scanning but scanning is easy compared to proofreading, and getting getting easier all the time. You can be sure that Google’s library would not be half so large if they had to pay for the kind of quality that Distributed Proofreaders provides. Unfortunately, if the price of this quality is only having thousands rather than millions of books, it is too high to continue paying. I propose a middle road between the raw image release and the meticulous text one. What if we distributed raw image and unverified text files from day one, but build our distribution network to enable everyone downloading a copy to upload corrections and share those corrections automatically with everyone else who has a copy? If we did that we could gain speed and scale while also building our community of contributers. Technologically, bittorrent and a rich client like miro would get us most of the way there. We would make each book into a miro channel that people would subscribe to when downloading the book. Once downloaded we would need a book reading view that we could optimize for whatever common reader actions relate to proofreading. Things like spell check and revealing the text around a section to verify academic citations spring immediately to mind. The key is that corrections should come primarily from people’s normal interactions with the books they are interested in, no altruism or active volunteering necessary. Once people have corrected their local copies, the client sends those corrections back to the central server where they can be sent out via rss to everyone subscribed to that book’s channel. As far as the user is concerned, she simply downloads the books she is interested in with her miro-based library manager and either fixes errors as they bother her, or leaves them alone and watches the text gradually correct itself as other people interested in the same books notice and correct errors. If the errors are really frustrating, she can always fall back to reading the page images and be no worse off than if reading on Google Books or any other large page image-based digital library. As far as the community is concerned, we get a larger pool of potential contributers because now everyone with a copy can contribute back, and people are able to contribute by sharing spare hard drive space and unused bandwidth rather than having to donate funds to pay for central hosting and distribution. There are plenty of people in the community who have no time or inclination to proofread but would gladly download some book images and leave a torrent running in the background to help share the files more widely. Making it easier to contribute increases the effectiveness of the project as a whole by helping make sure that all the people who care about a book have the opportunity to put their time into preserving that book. The more people care, the more work gets done. In two years of talking with people about my own book digitization projects, I have grown to have a healthy respect for how much people care about their own books and about preserving them, in whatever form. In the end, there are only two scalable digitization strategies: teach computers to read, or harness the passion people have for their books for the benefit of us all. A handful of highly organized editors like the Distributed Proofreaders community will always have it’s place, but they cannot handle the scale of this project alone. We should make sure they have some help. (Crossposted with bookliberator)