Scraping the Gorilla – Picking data off of Amazon

Note: this is the first in a series of posts on Amazon sales data as gleaned from the Amazon website.


There is little doubt that indie* publishing has given upcoming writers new options. They now can choose not to battle gatekeepers at publishing houses, or accept the onerous publisher contracts with their low royalties and rigid publishing schedules. Whether indie publishing is inconveniencing the Big 6** publishing houses is debated. That debate requires data, which publishing companies are loathe to reveal. Some of this data can, however, be gleaned from the 500 pound gorilla in book retail: Amazon.

Some questions that such data might answer:

  • How often do indie published titles earn royalties comparable to those of Big 6 titles?
  • How many indie authors earn royalty incomes rivaling those of Big 6 authors?
  • How many authors can make a living from their writing?
  • What are the relative sales in each genre/subgenre?
  • How important are online customer reviews/ratings to midlist sales?
  • Do series outperform standalone titles?

To answer these and other questions, I set out to scrape gather data from Amazon’s website. There had been discussions on the internet (ex: about what Amazon’s book rank means in terms of sales. Many authors have posted their rank vs sales figures, which helps to roughly deduce sales from rank.

Early on, I discovered, a project led by the much admired and successful indie author Hugh Howey. This project sought many of the same answers and published not only reports on project findings, but the raw (although anonymized) data on individual titles. Data mining has always fascinated me, and one thing I’ve learned from many years of analysis is how data can be tortured into confessing – confirmation bias. I wanted to analyze the data myself so carried on writing scripts to pull and parse pages on around 60,000 titles. This included the top 100 titles in the various fiction categories (there are over 400 categories), as well as the complete corpus of around 200 authors that included the 100 top ranked Amazon authors, authors on NPR’s 100 best SF list, and around 20 indie authors known to have enjoyed success. Although I am an indie myself, I did not set out to make a point about indie publishing.

Before I post data, I must warn of some important caveats.

  • Motion blur: My data is not a snapshot. Acquiring the data took weeks, so doesn’t reflect a single point in time. Rankings changed as data was being collected. This shouldn’t affect the overall picture it renders, but makes the data somewhat internally inconsistent.
  • Rank to sales algorithm: The algorithm used assumes that book rank is determined by unit sales (this metric has the strongest correlation with author rank – more on that in future posts) and is likely a moving average. The approximation I use involves seven equations (linear, power series, and exponentials) that span various regimes from #1 to #2,500,000. The estimate of gross sales integrated over all ranks does not compare unfavorably with published Amazon books sales figures (more on that later). Sales estimates will be least accurate for the top ranked books, since #1 could mean anything from 1,000 to 10,000 sales a day, depending on buyer whim. My algorithm arbitrarily assumes that #1 in eBooks means sales of 4000 per day, and assumes that eBooks outsell paper 1.5:1. Amazon does not rank audio books, so I cannot estimate sales.
  • Behind the curtain: One hopes that Amazon calculates the various rankings based on simple metrics without manipulating those numbers for their own purposes. Who knows?

In subsequent posts I will try to guess at the answer to the questions I posed above and others that occur to me.

For perspective, consider the following estimates of daily sales and royalty for the top fiction titles:

The Fault in Our StarsJohn GreenBig 66562$7,6534.8
DivergentVeronica RothBig 65834$4,3454.6
AllegiantVeronica RothBig 65493$6,6013.3
InsurgentVeronica RothBig 65424$6,2244.6
Divergent Series Box SetVeronica RothBig 65045$14,5474.3
I Am LiviaPhyllis T. SmithAmazon4030$7,0384.6
The GoldfinchDonna TarttBig 64001$6,4703.8
BloodlineJames RollinsBig 63904$3,0804.5
Home to StayTerri OsburnAmazon3779$6,6004.1
The Fixed TrilogyLaurelin PaigeSelf3709$1,3784.7
Plaster CityJohnny ShawAmazon3278$5,7254.1
The Way Life Should BeChristina Baker KlineBig 62906$4,3904.4
The Boleyn InheritancePhilippa GregoryBig 62902$3,3694.4
Missing YouHarlan CobenBig 62824$5,7234.3
Beach RoadJames PattersonBig 62753$5,5043.0
NYPD Red 2James PattersonBig 62671$4,1194.5
The Husband's SecretLiane MoriartyBig 62669$2,9784.3
Orphan TrainChristina Baker KlineBig 62331$2,6094.6
I've Got You Under My SkinMary Higgins ClarkBig 62324$4,3094.4
Killing Ruby RoseJessie HumphriesAmazon2275$1,9304.3
The AlchemistPaulo CoelhoBig 62109$1,9664.2
The Book ThiefMarkus ZusakBig 61833$1,7594.6
Shadow SpellNora RobertsLarge1760$2,3134.5
A Game of Thrones 5-Book Boxed SetGeorge R. R. MartinBig 61684$5,3624.5
Modern Wicked Fairy Tales: CollectionSelena KittBig 61649$2794.2
The Headmistress of RosemereSarah E. LaddBig 61649$2,7184.6
Gone GirlGillian FlynnBig 61437$2,0373.8
Tall, Dark, and Deadly 3 book box setLisa Renee JonesSelf1398$4844.3
The CollectorNora RobertsBig 61310$2,5464.5
HiddenCatherine McKenzieBig 61277$2,2313.8
The Invention of WingsSue Monk KiddBig 61244$2,4794.6
Little Girl LostBrian McGillowayBig 61150$4034.3
Too Many Crooks Spoil the BrothTamar MyersSmall1148$1934.0
The Maze RunnerJames DashnerBig 61131$8614.3

Note: Table data excludes audio. A “title” includes all text formats. “Self” means a publisher that has only one author; “Large” for 20 or more. Royalties are based on estimated unit sales, price, and typical royalty percentage for publisher type and format. Again, keep in mind the caveat that unit sales figures (and thus estimated royalties) for the top ranks are quite speculative.

What did I learn from this? I was frankly surprised at the revenue a single title can generate, at least when it’s hot. I was also surprised that fiction outsold non-fiction handily, about 5:1. While the above table shows only fiction, there were seven non-fiction titles interspersed, the hottest being in the #5 spot. I was also surprised at the dominance of genre fiction.

Two indie titles make this list. That could be misleadingly low if others on the list started as indie but were picked up by a publishing company along the way.

I shouldn’t have been surprised by the four Amazon published titles. Whether these were indies picked up by Amazon, or whether Amazon was one of the publishers queried, Amazon’s marketing clout on its own site may ensure the popularity of their own titles.


*As the term is used here, “indie” refers to writers who undertake to publish their work (ebook and/or print on demand) using online retailers without benefit of agents or publishing companies. They typically hire their own cover artists, editors, and formatters, and conduct their own marketing.

**I define Big 6 as the various publishing arms of the following six corporations: Hachette, Holtzbrinck/Macmillan, Penguin, HarperCollins, Random House, and Simon & Schuster. These companies comprise over 300 imprints. For purposes of calculation, I assume royalty arrangements are roughly the same across all imprints and their authors.


If the reader has suggestions for other interesting analyses, please leave them in the comments.



