The definitive guide to build your own Professional Networks Profile Scraper for 1M profiles (2022)

September 8, 2020

10 min read

Having built the early prototype for [Enrich Layer API]( networks) which turns professional networks...

Having built the early prototype for Enrich Layer API which turns professional networks profiles into JSON, I learnt a little bit about how one might be able to scrape public professional networks profiles in scale. In this tutorial, I will share my experience building a professional networks profile scraper that works in 2022, and I hope you will find it useful.

PS: You can turn user profiles into JSON with Enrich Layer API.

To put this tutorial in context, we will preface it with the problem of:

How to scrape 1 million user profiles, and then parse the HTML content into structured data?

Breaking down the problem:

  • How to crawl a million user profiles and fetch their on-page HTML content

  • How to parse the HTML content from a public professional networks profile to structured data

Part 1: How to scrape 1M public user profiles for HTML code

Before we embark on the quest to scrape a million profiles, let's start with crawling ten profiles. There are only two ways to crawl ten user profiles for scraping:

  • As a user logged into the platform. (A "logged in user")

  • Or, as a user that is not logged into the platform. (An "anonymous user.")

1A: Accessing user profiles as an anonymous user

It requires luck to access a professional networks profile without being logged into the platform.

In my experience, you might be able to access the first profile as an anonymous user if you have not recently clicked into any user profiles.

Even if you succeed viewing a public profile anonymously in your first attempt, more likely or not, you will be greeted with the dreaded Authwall on your second profile visit.

What is the Authwall and how do you circumvent it?

The Authwall exists to block web scraping from users who are not logged into the platform.

  • If you visit a public profile from a non-residential IP address, such as from a data center IP address, you will get the Authwall.

  • If you visit a public profile without any cookies in your browser session (aka incognito mode), you will get the Authwall.

  • If you are visiting a public profile from a non-major browser, you will get the Authwall.

  • If you are visiting a public profile multiple times, you will get the Authwall.

There are many reasons that you will be greeted with the Authwall when you are crawling anonymously. But there is one way you can reliably bypass it – crawl the platform as Googlebot. If you can access a the platform public profile page from an IP address that belongs to Google, you can consistently fetch an available professional networks profile without the Authwall.

What does an IP address from Google mean?

It is an IP address that resolves reversely to *.googlebot.com. See this Google support page for a clear definition. And no, IP addresses from Google Cloud instances do not work.

But, there is one page on the platform that you can crawl without restrictions

Put yourself in the shoes of a the platform executive. What makes you money? Profile data. Which is the Authwall is used to lock up profile data.

What else makes the platform money? Jobs rofessional Social Network makes money when companies list jobs on the platform. These companies will return to professional networking platforms again and again if the platform succeeds at matching great candidates to their job postings.

Job profiles on the platform are not blocked by the Authwall to maximize page views.

1B: Accessing user profiles logged into the platform

You and I are probably not Googlers, which means we do not have access to the range of addresses belonging to Googlebot. But there is respite.

You can log into the platform to reliably access user profiles. However, as tempting as it may be, I highly recommend that you not use your personal professional networks profile to perform a bulk profile crawl for scraping purposes. You do not want your personal professional networks profile to be blocked.

And it will be blocked should you scrape past a certain threshold or when the platform detects abnormal (automated) behavior in your account.

But yes, log into your professional networks profile, and you can crawl ten profiles with no problems. And that brings me to the next section – getting from 10 profiles to 1M profiles.

Can I crawl 1M user profiles to scrape by creating many the platform accounts?

It is only natural to veer towards the belief that you can build a professional networks scraper if you manage a pool of disposable the platform accounts. You are not wrong. Building a pool of workers with disposable the platform accounts is indeed a feasible method if and only if humans meticulously manage each the platform account.

Once you begin automated crawls on any the platform account, you will start encountering random Recaptcha challenges on accounts that will keep an account locked until they are solved.

Each the platform account in your scraping pool will also require a unique residential IP address. The short answer is yes. You can crawl 1M user profiles with many the platform accounts with residential IP addresses.

Recap: What you need to do to crawl 1M profiles

The first step to scraping is to get HTML code of profiles in scale. In this article, we put a number to "scale." One million profiles. There are only a few ways to crawl 1M user profiles, and they are

  • Access the platform from an IP address the resolves as Googlebot

  • Manage a large pool of workers logged in as individual the platform account, with each account sitting on residential IP addresses

  • Use Enrich Layer API – see the next section.

Using Enrich Layer API to enrich 1M user profiles

Enrich Layer is an offering we built that provides a managed service to turn professional networks profile URLs into structured JSON data.

If you ask me which is the best way to scrape user profiles, then I will tell you in a very biased way to use Enrich Layer's API. Specifically, the Person Profile Endpoint. Our Person Profile Endpoint takes a professional networks profile URL and returns you the structured data of the public profile.

Part 2: I have HTML code of a profile page, how do I scrape content off it?

Now that you have 1M profiles, it is time to get the content out of the HTML code into structured data. To convert HTML pages to structured data is what I define as "parsing." Crawling profiles gets you a bunch of pages as HTML code. Parsing turns pages of HTML code into machine-readable structured data, like this:

\{ 'accomplishment_courses': [], 'accomplishment_honors_awards': [\{'description': 'Nanyang Scholarship ' 'recognizes students who ' 'excel academically, ' 'demonstrate strong ' 'leadership potential, and ' 'possess outstanding ' 'co-curricular records.\n', 'issued_on': \{'day': None, 'month': None, 'year': 2015\}, 'issuer': 'Nanyang Technological University', 'title': 'NANYANG Scholarship'}, \{'description': 'Awarded to students with ' 'exceptional results in ' 'Physics and Mathematics', 'issued_on': \{'day': None, 'month': None, 'year': 2015\}, 'issuer': 'Defence Science & Technology ' 'Agency', 'title': 'Young Defence Scientist Programme ' '(YDSP) Academic Award'}, \{'description': 'An annual competition to ' 'encourage the study and ' 'appreciation of Physics as ' 'well as highlight Physics ' 'talent.', 'issued_on': \{'day': None, 'month': None, 'year': 2012\}, 'issuer': 'Institute of Physics Singapore', 'title': 'Singapore Junior Physics Olympiad ' '(Main Category) Honourable ' 'Mention'}, \{'description': 'Certificate awarded to ' 'student who topped the ' 'cohort in all aspects of ' 'Science.', 'issued_on': \{'day': None, 'month': None, 'year': 2010\}, 'issuer': 'Xinmin Secondary School', 'title': 'Certificate of Excellence - Top ' 'in Science'}, \{'description': None, 'issued_on': \{'day': 1, 'month': 9, 'year': 2018\}, 'issuer': 'Nanyang Technological University', 'title': "Dean's List FY17/18"}, ... 'volunteer_work': []}

Two ways to parse content from HTML code

There are two ways to scrape content from the HTML page, and the approach to take depends entirely on how the page is crawled.

Two factors decide which is the best method to use:

  • Is on-page javascript parsed before the HTML code of the profile page is collected?

  • Is the profile viewed as an anonymous user or as a user logged into the platform?

Method matrix for your reference

Anonymous user Logged into the platform

Javascript not rendered Dom Scraping Code Chunk Scraping

Javascript is rendered Dom Scraping Dom Scraping

Dom parsing

Dom parsing is the standard method that most developers use for web scraping. You can find the data within fixed HTML tags on a page that is loaded and rendered. You can fetch most content of a profile page by transversing HTML tags either via selectors or XPATH.

The problem is that the layout HTML pages are updated often and always. And layout varies according to locale. A profile loaded in Arabic locale will differ in layout from a profile loaded in English. Every time something changes, expect your scraper to break. Dom scraping is a high maintenance method but easy to implement.

Code Chunk Scraping

Code Chunk Scraping is a superior method reserved for profile pages fetched as a logged user; before javascript is rendered. It is a better method because it does not depend on HTML dom structure – and that means that page layout changes on the platform will not break this scraping method. What it does instead is that it looks at the data in-page placed within \<code\>\</code\> tags. These blobs of JSON data are used by the platform's javascript code to populate the page's dom elements. With the Code Chunk scraping method, you transverse JSON objects instead of Dom elements.

Because the JSON blob data is already stored in a structured manner, we do not have to tokenize strings to re-structure data and return the data as it is. That means you do not need to parse "12th March 2020" into a machine-readable Date object.

To recap: the Code Chunk scraping method

  • is faster to crawl because you can skip Javascript parsing

  • breaks less due to on-page layout changes

  • but, requires you to be logged into the platform when fetching profiles

Here is an example of data transversal with the Code Chunk Scraping method to return Patents Achievement from a user profile:

def get_patents(data): patent_ for dic in Person._type_in_include_rows(data, 'com.the platform.voyager.dash.identity.profile.Patent'): application_ issued_ issued_on_ \{\}) if issued_on_dic: issued_ 'day'), patent_ patent_lis += [Patent( application_ issued_ patent_ )] return patent_lis

So you want to build your own professional networks Profile Scraper

In this article, I explained that scraping user profiles is a two-step process.

The first step is to crawl user profiles and save the HTML code for further processing in the second step. The second step is to process the HTML code and turn raw HTML code into structured data that you can use in your application.

There are only two methods to crawl user profiles in scale – anonymously as Googlebot, or via a pool of workers logged into the platform with unique residential IP addresses. It is not impossible, but you can get yourself 1M HTML files if you work around these limitations.

The next step is to process these 1M HTML files and turn them into structured data for your application. If you crawled the page without rendering javascript but with an account logged into the platform, you should use the Code Chunk Scraping method, which is superior because it breaks a lot lesser. Otherwise, you can perform a regular scraping with your favorite Dom transversal library with the Dom Parsing method. (I recommend beautifulsoup4 if you are using Python)

Even if you are a well-funded startup, it is not trivial to crawl profile data in scale. You need a secret weapon.

Enrich Layer is a managed enrichment service for professional networks profile URLs.

Just like how you have chosen AWS instead of building and colocating your server farms, dataset acquisition is a menial task best left as a managed service. I can only write this article in such detail because of the combined expertise of our entire development team and learned experience over the years.

Why crawl the platform, when you can purchase an exhaustive the platform (public) profile dataset loaded with data of user profiles in the US?

Why manage a professional networks profile scraper when you can use our API and get a professional networks Profile in structured data for $0.01 per profile?

I will love to help your business integrate data at the core of your product. Send an email to [email protected] and let me know how I can help you with your data needs et Enrich Layer be your secret weapon.

The tutorial is not complete without code samples.

In this article, I shared in high-level how you might be able to scrape user profiles in scale. But a tutorial is not complete without code samples. In the follow-up article, I will be releasing fully-working code samples to complement this article. Please subscribe to Enrich Layer's mailing list [here](https://sendy.enrichlayer.com/subscription? to be notified of the next article with code samples