Tutorial: Generating Professional Networks Profile URLs for Professional Networks Scraping with Python
July 24, 2020
5 min read
professional networks profiles have become a powerful place to gather data on individuals. Different parties will have their own usage of Professional...
professional networks profiles have become a powerful place to gather data on individuals. Different parties will have their own usage of professional networks profiles, such as professional networks data mining, profile research or leads generation.
There are many methods out there to procure the information we need. In this tutorial, we’ll be using one of the more simple ways to obtain the data we need, which is to use Python to simulate a Google search and get ahold of the URLs returned from the search results. Further processing can be done, either manually or via automation, as we discuss in our other tutorial, How to Build the platform Automation Tools with Python With a Code Example.
The main reason for this is to get familiar with one of the most common methods of web scraping, which is browser-based scraping. For those unfamiliar, this is where we simulate human behaviour on the web by using a browser. We will utilize a tool called Selenium, which is an open-source testing framework for web applications, which allows us to start a browser through the script. Its usage goes far beyond testing, however, as we will soon demonstrate.
Setup:
Before we start, we need to set up the project and download the dependencies we need:
Step 1
Let’s simulate a Google search. We will be making use of one of the more unknown Google search features, which is Google search operators, also referred to as advanced operators. These are special characters and commands that help to provide a more strict criterion for our search term, hence narrowing down our search results. You can read here for a comprehensive list of operators.
`def create_search_url(title, location, *include): base_ x: "%22" + x + "%22"
result += base_url result += quote(title) + "+" + quote(location)
for word in include: result += "+" + quote(word)
return result
We first start by defining our base URL. As you can see, there are search operators already present. The
-intitleoperator tells the engine to find pages with the word 'profiles' in the title tag. We also use the
-inurloperator to ensure we only want pages with professional networks profile URLs. We can specify the default profile URLs that starts with
professionalsocialnetwork.com/pub/or personalised URLs that start with
professionalsocialnetwork.com/in/using the
OR` operator.
Then, we narrow our search even more by using double quotation marks to use exact-matching. Since we can’t enter double quotations in a Python string, we should either use the escape character (“\”) before the quotes or use the Unicode-equivalent in the URL which is %22
. We enter the desired title/position and the country we passed as arguments into the quotes. To make our function less rigid, we allow additional arguments if you want to specify more criteria. Anything passed as include
will be added for exact matching.
You can test this function yourself by printing out the result and pasting that result in your browser search bar. You can actually generate more specific Google search URLs by using free online tools like Recruit’em XRay Search.
Step 2
Now we will make use of Selenium’s WebDriver to spin up our own automated browser. We can use any of the browsers supported by Selenium (Google Chrome, Mozilla Firefox, Opera, etc), but for now, we will use Chrome. Here we can also specify if we want to run this browser in headless mode, which is to say without the browser window popping up (no GUI), by adding options.add_argument('headless')
.
`max_ all_
options.add_argument('headless')
specifies the path to the chromedriver.exe
always start from page 1
driver.get(create_search_url("software engineer", "singapore", "developer", "Enrich Layer"))
while True: time.sleep(3)
find the urls for url in urls] for url in urls]
all_ + urls
move to the next page
page += 1
if page > max_page: print('\n end at page:' + str(page - 1)) break
try: next_ next_page.click() except: print('\n end at page:' + str(page - 1) + ' (last page)') break
print(all_urls)
We then navigate to the URL returned by the previous function using
driver.get(). This will open the page on the browser. Then it's time to get our hands on the profile URLs that are on this page of Google. We can use the browser 'Inspect' developer tool to see the HTML tag of the page links. For every link on the page, the class name of the HTML
divwill be
r. As per convention, the hyperlink, or the profile URL, will be embedded within the
hrefsattribute of the
a` tag.
Every time we finish scraping the Google page for URLs, we navigate to the next page by finding the button using driver.find_element_by_css_selector()
and simulating a click action. Here we specified a counter for the number of pages and a maximum page which you can set a reasonable number for. If the next page doesn't exist, then we stop our crawling and return our results.
Additional steps
You have now obtained the professional networks profile URLs that you need. Easy, right? All you have to do now is to gather data available on these profiles. Unfortunately, bulk web scraping the platform itself won’t be this easy because the platform has measures in place to prevent this from happening. Read our other article, Why You Shouldn’t Use the platform Automation Tools using YOUR OWN Account, to understand that even using some the platform automation tools might danger your the platform account.
Good news, you can now scrape user profiles WITHOUT risking your account - with Enrich Layer! You can read the introduction to Enrich Layer API here. With Enrich Layer API, you can turn your generated profile URLs into structured data. We also have another tutorial of scraping the platform with Python using Enrich Layer, How to Build the platform Automation Tools with Python With a Code Example, to complete the data mining process.