There are several legitimate reasons for scraping Amazon data, while there are also dubious and questionable ones which will not be tackled in this article. These licit reasons include keeping watch of pricing and other competitor data, with the goal of getting ahead of them.
Other businesses operating in the same industry as yours are mostly likely scraping Amazon for these data as well, so you will get left behind when you don’t employ the same tactics. Amazon may also be one major source of review scores if your aim to gather product review scores from around the Internet.
How does Amazon treat data scraping?
As mentioned earlier, there are also questionable reasons for scraping Amazon, which aren’t recommended. Scraping Amazon data in itself is already against the e-commerce site’s terms of service, and using the data for illicit transactions will not only result in your IP being banned but even in legal action being taken against you.
Amazon started really enforcing their policies against data scraping in 2012, but a lot of businesses had already gotten away with it in the past. Repercussions, however, have only been limited to Amazon banning the IP addresses used in suspected data scraping.
Although the e-commerce giant has an automated system in place that detect data scrapers and ban their IPs, users have found out that this system doesn’t block all data scraping activities. You can check out this 2014 thread on Amazon Seller Central to find out that Amazon isn’t too tough in implementing their anti-data scraping policy. There are data scrapers that are banned, while there are those that are able to get past Amazon’s automated system. Those that are banned are most likely involved in bot-like activities, while the latter are able to bypass the restriction because of some strategies or guidelines in place, which will be tackled in this article.
- Hide behind proxy servers
- Private proxies are a lot better than public proxies
- Configure bots to imitate organic behavior
- Spoof your request headers
- Keep a list of URLs
Guidelines in scraping Amazon safely
The reason why Amazon is not so strict with regards to the implementation of their anti-scraping policy is perhaps because it’s actually impractical to do so. When they ban a whole IP block, for instance, for the reason that some of the IP addresses on that block are suspected to be engaged in data scraping, they run the risk of banning a whole village of potential customers. This is not a good way to do business.
In addition, Amazon is a very large e-commerce website, receiving and processing millions of data in an hour. Filtering each and every data will be too difficult, not to mention costly, so data scraping is still possible and even allowed for a certain degree, when these guidelines are followed.
Guideline 1: Hide behind proxy servers
Whether you’re scraping data from a website, purchasing limited releases of top brands, or testing an ad campaign, it makes sense to use proxy servers. Hundreds or thousands of requests coming from a single IP address will immediately flag you, resulting to an IP ban, so hiding behind proxy servers is necessary when scraping.
You can set up your own proxy server, of course, but it’s ultimately a lot cheaper and easier to let someone else manage it for you. There are several proxy providers out there, but be careful when selecting one since there are those that don’t really provide protection. LimeProxies is a good provider of proxies for Amazon scraping, as it offers different features, some of which are dedicated IPs, hundreds of subnets from different locations, and it works on all classified ads site.
If Amazon detects abnormal activities coming from a proxy server, it blocks the proxy only and not your real IP address, so you can still continue your data scraping mission using a different proxy server. This may sound tedious because Amazon might detect the next proxy again and ban you, and the cycle won’t end until you run out of proxy servers. Also, large websites like Amazon have a system in place that automatically kicks out anyone who is suspected of using a proxy since they assume that you have a malicious intention because you are using a proxy.
However, the next guidelines will ensure that it will be difficult to detect and ban you when you use a proxy server.
Guideline 2: Private proxies are a lot better than public proxies
Why should you pay for private proxies when there are public and cost-free proxies available? Here are some reasons why private proxies are always a lot better than public proxies:
- Public proxies are easy to ban: As a matter of fact, public proxies are most likely to be banned by websites like Amazon already even before you’ve used them because someone else may have used them before in doing suspicious activities. These kinds of proxies can be used by anyone, which is why they are called “public” proxies. Because they are for free, a lot of people are using them. Serious marketers and businesses tend to stay clear of public proxies and use private proxies instead.
- Public proxies are laden with ads: Since users don’t pay anything to use them, public proxies earn money through ads. Your browsing and scraping are constantly interrupted by ads, which can be a nuisance since time is of the essence in such marketing activities.
- Public proxies have slow speed: Speed is the tradeoff for using something and not paying for it. Thousands of people can be using the same proxy server at a time which will definitely result in a slow connection. With private proxies, only a few users are on the same proxy server at a time, so speed will not be an issue. There are also providers that offer dedicated IPs, which means that no one else but you will be using the same IP addresses.
As you can see, private proxies should be the choice of business owners or marketers who are serious about data research and harvesting. You may have to spend some money, but you will also get security and reliability as a trade-off. On the other hand, public proxies do not cost a thing, but you also pay in terms of slow speed. They are also not too reliable since they can easily be banned, or are already banned in some cases.
Guideline 3: Configure bots to imitate organic behavior
Amazon, as a large business, has invested a lot in their systems including those that detect bots. There is therefore no reason to take it lightly even when the company isn’t so strict in implementing their policy. Since you are most likely to use bots to scrape Amazon, make sure that you configure your scripts to make their commands look organic and more human.
Most marketers or scrapers use either a scraper software or scripts, and there’s no issue with that. However, they fail to configure these software or script to make sure that the behavior they emulate are more human, and not robot-like. Take for instance these scenarios:
- Sending too many requests in just a short amount of time: Scraper software executes multiple requests as quickly as it can by default. That means hundreds of requests in just a minute, and that will undoubtedly create a red flag with Amazon. It is impossible for humans to make that many requests in just a short period of time.
- Sending requests in a fixed time interval: Scraper scripts and software send the same requests again in again in a fixed time interval, say one request every second. Amazon has categorized behaviors like this as bot-like, and will most definitely ban your proxy IP address in no time. Humans do not click through websites in a fixed time interval. Rather, requests made by humans are timed unpredictably and differently from the previous requests. Therefore, you have to customize the scraper software or script in such a way that the time interval is not repetitive or robotic.
Guideline 4: Spoof your request headers
Every network traffic data, whether you’re using your real IP address or a proxy IP, contain user agent headers which are actually information about the device you’re using. This information may include the browser you’re using and the language set on it, and even the operating system version and build of your device.
These are anonymized, of course, but the user agent of all requests coming from the same device will look the same even when you obfuscate your IP. Therefore, Amazon can still detect that the large amount of requests are coming from only one machine just by looking at the user agent. There are proxy providers like LimeProxies that automatically disable headers, making all traffic highly anonymous, so you will be better off using these providers.
Guideline 5: Keep a list of URLs
In a perfect world, your scraper software and scripts will run smoothly all throughout the duration of the procedure. However, there will be times when your software will crash, you’ll run out of space, or something else will happen that causes the scraping project to stop before it is completed. It is therefore necessary to keep a list of all URLs that have been crawled already, so that you can continue from where you left off. You can save a lot of time when you keep this list since you don’t have to start from the very beginning.
Bonus guidelines in scraping Amazon
Other things to keep in mind when scraping Amazon (or other websites, for that matter) are:
- Change or rotate proxy IPs: A good private proxy server provider also allows users to rotate IP addresses anytime, on demand or at a pre-set interval. With a changing proxy IP, your requests are spread out over different IP addresses, making them look less suspicious and more organic. For this reason, you would want to select a provider that allows for simultaneous usage of IP addresses.
- Do not copy scraped product description: One of the information you’ll gather when you scrape Amazon are product descriptions. If you’re planning to use these on your own website, take note that doing so will cause Google to tag you as duplicate content, resulting in a poor performance in the search results. Copying product descriptions word for world will hurt your SEO strategy, so avoid doing it.
- Log out of your Amazon account: Just in case you don’t know, you have to log out of your Amazon account before you do any scraping, serious or not, small-scale or large-scale. This is no-brainer since your Amazon account will be banned if you don’t log out before you deploy your script or software.
There are also tools you can use to make scraping a lot easier. Octoparse and Python Scrapy are just some of the examples of such tools that can help you scrape Amazon data and in data mining and information processing.
Scraping Amazon safely
Websites as large as Amazon have improved their automated filtering systems to go beyond just the appearance of the request (IP address and user agent). They also look at the behavior of each request and can easily detect bots, and distinguish bot actions from human behavior. When they detect a bot-like behavior, they will automatically ban your IP, no questions asked. To avoid this from happening, use rotating proxies, hide or customize the user agent headers, and configure your scripts or software to mimic human behavior.
It is also important to scrap only what is necessary to save time and maximize speed. When your scraping involves too much data, the process will slow down and it will take time to complete. Furthermore, when you have useless information along with all the data you need, you will spend a lot of time filtering all the data. It is therefore wise to set your scraping in accordance to what is only necessary.
These guidelines were learned and collated after years of using proxy servers and scraping different websites. Data scraping has become a necessary step in conducting online business, no matter what industry. When done correctly and safely, all the data you harvest will contribute to the success of your online business.