This is a two-part series on crawling professional networks in scale. In this first part, we study why professional networks is a hard target to crawl. In the follow-up part two, I will dive deep into a technical tutorial on how you can crawl professional networks in scale with demo code.
Everybody wants a piece of the platform, especially so since they have their data under a tight noose. Companies such as "hiQ Labs" have been sued for circumventing, but alas, the courts ruled that it is [perfectly legal](https://www.theverge.com/2017/8/15/16148250/microsoft-the platform-third-party-data-access-judge-ruling) for companies to crawl their sites.
Before we move onto a technical guide on how you can crawl the platform, let's understand why it is hard to crawl the platform in scale:
1. You need to be logged into the platform to gain access to content
To view any profiles on the platform, you have to be logged in.
2. the platform requires Javascript to render content on the page
the platform pages load quickly, but the important data are fetched via API calls with Javascript after the page is loaded.
3. the platform blocks your IP when you crawl too much or too fast
IP address are rate limited if it is used too much.
