Skip to main content

Google Library Project - The fineprint

As has been reported quite widely, Google has begun a massive digitization project with five libraries:

The total covered by existing agreements is said to be 15 million. Each is estimated to cost $10 to scan. Stanford's scanning unit is said to be able to do 100,000 pages a day. Oxford's scanning unit is said to be able to do 10,000 books per week. If all of them are that speed then by my math it will take a little over five years to scan them all. Similarly, the University of Michigan says the project will take six years.

Most agreements indicate that the hosting library will get a digital copy of their books, which apparently they will then host for their users. In addition, Google will throw all the books into its Google Print service.

Some books are already available through the service. For example, Books and Culture is an out-of-copyright book from 1896. Note that unlike a publisher-submitted book, you can easily link to or view any page: the cover, the University of Michigan bookplate, page 50, the U of M checkout slip, the back cover. You can also search the full text leading to a standard Google results page with links and snippets. Click on any of the links and the resulting page will highlight your search terms, just like Google Catalog.

Sadly, it seems the only thing not available is the full text of the books. However, it is pretty easy to get the underlying images of the pages (tho not as easy as simply looking at the page, alas) so one could certainly OCR it themselves if they liked, although it'd likely not be as good as Google's work. Things look much worse for in-copyright books. For example, The Role of GATT in Relation to Trade and Development was only published in 1964 and is apparently in-copyright. One can thus only get back practically useless snippets while the fat-cats at Google have the whole thing.

Fortunately, "Google is negotiating with various publishers to facilitate arrangements to make works more easily accessible while providing appropriate protections for copyright holders" for in-copyright library books. It will be interesting to see how much success they have. It's not clear how to search Google for just library books, or even just books, or to find out how many they have, but here are the handful I know about, all from U. of M. (books published after 1923 are copyrighted):

Do you hold the copyright on a book? Does your book have an ISBN? If you answered yes to both these questions, you don't have to wait for all this. You can simply sign up to Google Print, send Google a copy of your book, and they'll scan it in and OCR it for you for free! Then they'll send you checks with all the money your book makes through ads! So please do it! Please?

A closing thought. Much of the discussion around this endeavor has focused on its effect for the largely-affluent and privileged children who go to the major universities from which the books are taken. Will they stop going to the library? Will they miss the smell of dead trees? Will they be able to do research more efficiently? With all due respect, this is the wrong group to think about. The real beneficiaries of this scanning should be the less fortunate people around the world who barely have access to a library, let alone a world-class one. Let us scan these books for them.

By Aaron Swartz (me@aaronsw.com) of Google Weblog

Popular posts from this blog

How to Download Contacts from Facebook To Outlook Address Book

Facebook users are not too pleased with the "walled garden" approach of Facebook. The reason is simple - while you can easily import your Outlook address book and GMail contacts into Facebook, the reverse path is closed. There's no "official" way to export your Facebook friends email addresses or contact phone numbers out as a CSV file so that you can sync the contacts data with Outlook, GMail or your BlackBerry. Some third-party Facebook hacks like "Facebook Sync" (for Mac) and "Facebook Downloader" (for Windows) did allow you to download your Facebook friends' names, emails, mobile phone number and profile photo to the desktop but they were quickly removed for violation of Facebook Terms of Use. How to Download Contacts from Facebook There are still some options to take Friends data outside the walls of Facebook wall. Facebook offers the Takeout option allowing you to download all Facebook data locally to the disk (include

Digital Inspiration

Digital Inspiration is a popular tech blog by  Amit Agarwal . Our popular Google Scripts include  Gmail Mail Merge  (send personalized emails with Gmail ),  Document Studio (generate PDFs from Google Forms ) and   File Upload Forms ( receive files  in Google Drive). Also see  Reverse Image Mobile Search , Online Speech Recognition and Website Screenshots , the most useful websites on the Internet.

PhishTank Detects Phishing Websites by Digg Style Voting

OpenDNS, a free service that helps anyone surf the Internet faster with a simple DNS tweak , will announce PhishTank today. PhishTank is a free public database of phishing URLs where anyone can submit their phishes via email or through the website. The submissions are verified by the other community members who then vote for the suspected site. This is such a neat idea as sites can be categorized just based on user feedback without even having to manually verify each and every submission. PhishTank employs the "feedback loop" mechanism where users will be kept updated with the status' of the phish they submit either via email alerts or a personal RSS feed . Naturally, once the PhishTank databases grows, other sites can harness the data using open APIs which will remain free. OpenDNS would also use this data to improve their existing phishing detection algorithms which are already very impressive and efficient. PhishTank | PhishTank Blog [Thanks Allison] Related: Google