Ultimate Authority in Private Proxies

Web Scraping with Selenium: DIY or Buy?

W

There are different frameworks and libraries that you would have to learn and make use of while understanding the basics of web scraping. With good knowledge of various HTTP methods like GET and POST and utilizing selenium web scraping, your data extraction process would become easier.

Selenium is a widely known tool for automated web browsing interactions. Combining it with other technologies like BeautyifulSoup would give you even better results when you perform web scraping.

Table of Content

1 . Setting Up Selenium

2. Quick Starting Selenium

3. Data Extraction with Selenium by Locating Elements

4. Selenium vs Scraping Tools: Real-Time Crawler

Selenium works by automating the processes of your written script so there is no need for human intervention like clicking, scrolling, etc. to facilitate the interaction between the script and the browser.

Interesting Read : Using Web Scraping for Lead Generation

Even though selenium is described as the perfect tool for testing web applications, its functions go beyond that.

And so in this guide, we would be dealing with selenium web scraping using python 3.x. as the input language.


Setting Up Selenium

You would need to download the selenium package first and to do so, execute this pip command in your terminal:

pip install selenium

After this, you would need to install selenium drivers too. This will allow python to control and interact with the web browser on the level of the operating system. If you are doing a manual installation, it would be available via the PATH variable. Selenium drivers for Firefox, Edge, and Mozilla can be downloaded here.


Quick Starting Selenium

Let us begin by staring up your web browser:

1 . Open a new browser window

2. Load any page of your choice. In this instance, ours would be used

from selenium import webdriver
browser = webdriver.Firefox()
browser.get(‘http://limeproxies.com/’)

Doing this will launch a headful mode. If you want to switch your browser into a headless mode and run it on a server, it should first look like this:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
options.add_argument(“–window-size=1920,1200”)

driver = webdriver.firefox(options=options, executable_path=DRIVER_PATH)
driver.get(“https://www.oxylabs.io/”)
print(driver.page_source)
driver.quit()


Data Extraction with Selenium by Locating Elements

Find_element

There are different functions that you can use to find elements using selenium on a page:

1 . Find_element_by_id

2. Find_element_by_name

3. Find_element_by_xpath

4. Find_element_by_link_text (that is using text value)

5. Find_element_by_partial_link_text (that is by matching some part of a hyperlink text)

6. Find_element_by_tag_name

7. Find_element_by_class_name

8. Find_element_by_css_selector (that is using a CSS selector for id class)

For example, lets locate the H1 tag on limeproxies homepage using selenium

<html>
<head>
… something
</head>
<body>
<h1 class=”someclass” id=”greatID”> Partner Up With Proxy Experts</h1>
</body>
</html>

h1 = driver.find_element_by_name(‘h1’)
h1 = driver.find_element_by_class_name(‘someclass’)
h1 = driver.find_element_by_xpath(‘//h1’)
h1 = driver.find_element_by_id(‘greatID’)

You can also use the find_elements function to return to a list of elements:

all_links = driver.find_elements_by_tag_name(‘a’)

Doing this will provide you with all anchors in a page. Some elements are however not easy to access using an ID or class, so you would need XPath.

XPath

XPath is a syntax language and can help you find an object in DOM. It finds the node from the root element using either a relative path or an absolute path. Example:

1 . / : select node from the root. /html/body/div(1) will find the first div

2. //: select node from current node irrespective of their location. //form(1) will find the initial form element

3. (attributename=’ value’): a predicate. It finds a specific node or a node with a specific value

//input[@name=’email’] will find the first input element with the name “email”.

<html>
<body>
<div class = “content-login”>
<form id=”loginForm”>
<div>
<input type=”text” name=”email” value=”Email Address:”>
<input type=”password” name=”password”value=”Password:”>
</div>
<button type=”submit”>Submit</button>
</form>
</div>
</body>
</html>

WebElement

In selenium, WebElement represents an HTML element. The following are some of the most common actions:

1 . Element.text (to access text element)

2. Element.click() (click on element)

3. Element.get_attribute(‘class’ (to access attribute)

4. Element.send_keys(‘mypassword”) (send text to an input)

Slow Websites Render Solutions

Some websites have a lot of JavaScript in their encoding to render content. It can be a bit tricky to deal with this as they also use a lot of AJAX calls and this issue can be solved in any of the following ways:

1 . Time.sleep(ARBITRRY_TIME)

2. WebDriverWait()

Example

try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, “mySuperId”))
)
finally:
driver.quit()

This way, the located element can be loaded after 10 seconds.


Selenium vs Scraping Tools: Real-Time Crawler

If you want to learn web scraping, a great option would be to use selenium. It’s best to use it together with BeautifulSoup, learning HTTP protocols, the processes involved in data exchange between server and browser, and also how cookies and headers work.

Interesting Read : How web scraping can benefit the real estate industry?

If you are looking for an easier way to perform web scraping, you have a variety of tools to help with this. Depending on the amount of data you wish to collect, and the targets, using a web scraping tool would not only save you time but will also save you resources.

A real-time crawler is a tool that can be used for an easier web scraping process. Its two main functionalities are:

1. Data API: this is mainly for re-commerce and search engine websites, and it allows you to receive the data in structured JSON format

2. HTML Crawler API: this functionality allows you to scrape most websites in HTML 

You can easily integrate real-time crawler, and here is the process for python:

import requests
from pprint import pprint

# Structure payload.
payload = {
‘source’: ‘universal’,
‘url’: ‘https://stackoverflow.com/questions/tagged/python’,
‘user_agent_type’: ‘desktop’,
}

# Get response.
response = requests.request(
‘POST’,
‘https://realtime.oxylabs.io/v1/queries’,
auth=(‘user’, ‘pass1’),
json=payload,
)

# This will return the JSON response with results.
pprint(response.json())

With real-time crawler and selenium, there are a lot of advantages including:

1 . Automated web scraping processes

2. Easy scraping

 3. No need for extra coding

4. There is a built-in tool for proxy rotation

5. Every successful request has a guaranteed 100% success rate


Conclusion

Using selenium for web scraping will make the job easier for you, especially if you are new and learning the basics. Even though selenium web scraping is efficient, you may need to perform large scale scraping and require an already built tool that would facilitate your data extraction process.

A real-time crawler is an example of such a tool and in combination with selenium, you can expect great results.

About the author

Rachael Chapman

A Complete gamer and a Tech Geek. Brings out all her thoughts and love in writing blogs on IOT, software, technology etc

Ultimate Authority in Private Proxies