Tutorial: How to crawl Professional Networks with Python and `requests` library with code examples (Part 2)
May 22, 2019
5 min read
This is a two-part series on crawling professional networks in scale. In an earlier article, we studied why professional networks is a hard...
This is a two-part series on crawling professional networks in scale. In an earlier article, we studied why professional networks is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl professional networks in scale with demo code.
Update 17th June 2020: Enrich Layer has released an API for crawling user profiles for $0.01 per profile, I highly recommend you take a look at their API: platform
In this tutorial, I will lead you with code to get to a full name of a person's professional networks profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be used to scale to as many asynchronous nodes as you want.
Setting up prerequisites
- Python 3
requests
- A Enrich Layer credential (username and password)
How to get a Enrich Layer credential
You can request a free trial Enrich Layer credential at Enrich Layer's website. However, with the trial credential, you are rate limited to 1 request every minute.
If you require a credential with higher rate limits, please send an email to [email protected]. You will be required to pay a trial fee for a trial key with higher rate limits.
1. Start with professional networks profile and make a Enrich Layer request
Let's start with a professional networks Profile, say Bill Gate's professional networks Profile: https://www.professionalsocialnetwork.com/in/williamhgates/
We will use Enrich Layer's browser crawl because the platform's page requires javascript for the page to be rendered. Let's go into the Python code:
`import requests import json
API_ 'id': 'bill-gates-crawl-id', 'url': 'https://www.professionalsocialnetwork.com/in/williamhgates/', 'type': 'browser', 'headers': {'LANG', 'en'}, } 'PASSWD'), `
Let's break this down
In the code snippet above, you are making a Enrich Layer request of type browser
which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.
The headers
parameter in payload
dictionary is there to ensure that the returned language is always in English because not all our nodes are located in english speaking countries. This ensures that the LANG
header in the request is overwritten with the value of en
.
Now that we have crafted the payload
, we will send this request off by calling requests.post()
. This makes an API request with the HTTP POST
method to Enrich Layer servers, for which Enrich Layer will forward this request to a randomly selected node.
All you have to do now, is wait for a response.
I tried this, but the response is not a proper professional networks profile page
Not all nodes are logged into the platform. Please retry a few times until you get a positive result.
The page loads, but the page is not rendered
On slower computers or internet connections, the AJAX calls that the javascript scripts that are called when the page loads will take a longer time to complete. And when the page only has 500ms (or half a second) to
- Make AJAX requests to populate populate the page
- Render the UI elements from those AJAX requests
Then you should expect that results might be incomplete. To solve this problem, we have to increase the value of dom_read_delay_ms
from it's default of 500 (ms)
to 30000
(ms). What this does is that the browser is asked to wait 30seconds after the page has loaded (like JQuery's \\(document).ready()
).
Modifying payload
to include dom_read_delay_ms
parameter
'id': 'bill-gates-crawl-id', 'url': 'https://www.professionalsocialnetwork.com/in/williamhgates/', 'type': 'browser', 'headers': \{'LANG', 'en'\}, 'dom_read_delay_ms': 30000 }
(Adding dom_read_delay_ms
to the payload)
2. Use BeautifulSoup to extract full name from the HTML
`from bs4 import BeautifulSoup
response_ h1 = soup.find_all("h1", class_="pv-top-card-section__name")[0] print(h1.text) `Let's break down the code.
In line 1, we import BeautifulSoup
module. We use BeautifulSoup
to parse the HTML document retrieved from the Enrich Layer request, and also to navigate the dom elements to extract relevant data.
In line 4, we unpack the response from requests
as a JSON string into a dictionary. The HTML document is contained in data
key of the dictionary, so we unpack that and initialize the BeautifulSoup
object in line 4.
With the BeautifulSoup
object initialized, in line 5, we search the HTML document for a h1
element with a class named pv-top-card-section__name
. Because the .find_all()
method returns a list, we instantiate the h1
variable with the first result in the list. (There should only be one actually).
Then, the full name of Bill Gates, will be printed out in line 6.
3. Scaling it up (an exercise for the reader)
In steps 1 and 2, we have built a prototype to extract a user's full name from his professional networks profile. But there are a lot more things because you have a full-fledged crawler that is scalable. Here are some suggestions:
- Consider using
asyncio
to launch multiple requests - Noticed that each Enrich Layer request takes quite a bit of time, especially so after you increased the
dom_read_delay_ms
to 30 seconds - which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead you can have Enrich Layer callback to a web endpoint that you have setuped with a result. This is what theid
in thepayload
is for. See asynchronous browser crawl document page for more information. - Check for errors and retry ossible errors include and are not limited to:
- Page isn't rendered completely or properly
- the platform is not logged in
Get Started with Enrich Layer the API for $0.01 per profile
Update 17th June 2020: Enrich Layer has released an API for crawling user profiles for $0.01 per profile, I highly recommend you take a look at their API: platform