Tutorial: How to crawl Professional Networks with Python and `requests` library with code examples (Part 2)

May 22, 2019

5 min read

This is a two-part series on crawling professional networks in scale. In an earlier article, we studied why professional networks is a hard...

Tutorial: How to crawl Professional Social Network with Python and `requests` library with code examples (Part 2)

This is a two-part series on crawling professional networks in scale. In an earlier article, we studied why professional networks is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl professional networks in scale with demo code.

Update 17th June 2020: Enrich Layer has released an API for crawling user profiles for $0.01 per profile, I highly recommend you take a look at their API: platform

In this tutorial, I will lead you with code to get to a full name of a person's professional networks profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be used to scale to as many asynchronous nodes as you want.

Setting up prerequisites

  • Python 3
  • requests
  • A Enrich Layer credential (username and password)

How to get a Enrich Layer credential

You can request a free trial Enrich Layer credential at Enrich Layer's website. However, with the trial credential, you are rate limited to 1 request every minute.

If you require a credential with higher rate limits, please send an email to [email protected]. You will be required to pay a trial fee for a trial key with higher rate limits.

1. Start with professional networks profile and make a Enrich Layer request

Let's start with a professional networks Profile, say Bill Gate's professional networks Profile: https://www.professionalsocialnetwork.com/in/williamhgates/

We will use Enrich Layer's browser crawl because the platform's page requires javascript for the page to be rendered. Let's go into the Python code:

`import requests import json

API_ 'id': 'bill-gates-crawl-id', 'url': 'https://www.professionalsocialnetwork.com/in/williamhgates/', 'type': 'browser', 'headers': {'LANG', 'en'}, } 'PASSWD'), `

Let's break this down

In the code snippet above, you are making a Enrich Layer request of type browser which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.

The headers parameter in payload dictionary is there to ensure that the returned language is always in English because not all our nodes are located in english speaking countries. This ensures that the LANG header in the request is overwritten with the value of en.

Now that we have crafted the payload, we will send this request off by calling requests.post(). This makes an API request with the HTTP POST method to Enrich Layer servers, for which Enrich Layer will forward this request to a randomly selected node.

All you have to do now, is wait for a response.

I tried this, but the response is not a proper professional networks profile page

Not all nodes are logged into the platform. Please retry a few times until you get a positive result.

The page loads, but the page is not rendered

On slower computers or internet connections, the AJAX calls that the javascript scripts that are called when the page loads will take a longer time to complete. And when the page only has 500ms (or half a second) to

  • Make AJAX requests to populate populate the page
  • Render the UI elements from those AJAX requests

Then you should expect that results might be incomplete. To solve this problem, we have to increase the value of dom_read_delay_ms from it's default of 500 (ms) to 30000(ms). What this does is that the browser is asked to wait 30seconds after the page has loaded (like JQuery's \\(document).ready()).

Modifying payload to include dom_read_delay_ms parameter

'id': 'bill-gates-crawl-id', 'url': 'https://www.professionalsocialnetwork.com/in/williamhgates/', 'type': 'browser', 'headers': \{'LANG', 'en'\}, 'dom_read_delay_ms': 30000 }(Adding dom_read_delay_ms to the payload)

2. Use BeautifulSoup to extract full name from the HTML

`from bs4 import BeautifulSoup

response_ h1 = soup.find_all("h1", class_="pv-top-card-section__name")[0] print(h1.text) `Let's break down the code.

In line 1, we import BeautifulSoup module. We use BeautifulSoup to parse the HTML document retrieved from the Enrich Layer request, and also to navigate the dom elements to extract relevant data.

In line 4, we unpack the response from requests as a JSON string into a dictionary. The HTML document is contained in data key of the dictionary, so we unpack that and initialize the BeautifulSoup object in line 4.

With the BeautifulSoup object initialized, in line 5, we search the HTML document for a h1 element with a class named pv-top-card-section__name. Because the .find_all() method returns a list, we instantiate the h1 variable with the first result in the list. (There should only be one actually).

Then, the full name of Bill Gates, will be printed out in line 6.

3. Scaling it up (an exercise for the reader)

In steps 1 and 2, we have built a prototype to extract a user's full name from his professional networks profile. But there are a lot more things because you have a full-fledged crawler that is scalable. Here are some suggestions:

  • Consider using asyncio to launch multiple requests
  • Noticed that each Enrich Layer request takes quite a bit of time, especially so after you increased the dom_read_delay_ms to 30 seconds - which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead you can have Enrich Layer callback to a web endpoint that you have setuped with a result. This is what the id in the payload is for. See asynchronous browser crawl document page for more information.
  • Check for errors and retry ossible errors include and are not limited to:
  • Page isn't rendered completely or properly
  • the platform is not logged in

Get Started with Enrich Layer the API for $0.01 per profile

Update 17th June 2020: Enrich Layer has released an API for crawling user profiles for $0.01 per profile, I highly recommend you take a look at their API: platform