Selenium Web Scrapping

Posted on 30-04-2021 by admin

Aug 22, 2020 Sample code to get all links present on a webpage using Selenium WebDriver with Java. At times during automation, we are required to fetch all the links present on a webpage. Also, this is one of the most frequent requirements of web-scrapping. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well. Some good guy has written a.net wrapper for selenium which can be used in VBA and VB.net and that is the wrapper we are going to use in this tutorial. Most popular libraries or frameworks that are used in Python for Web – Scrapping are BeautifulSoup, Scrappy & Selenium. In this article, we’ll talk about Web-scrapping using Selenium in Python. And the cherry on top we’ll see how can we gather images from the web that you can use to build train data for your deep learning project. Using Selenium v3.x opening a website in a New Tab through Python is much easier now. We have to induce an WebDriverWait for numberofwindowstobe(2) and then collect the window handles every time we open a new tab/window and finally iterate through the window handles and switchTo.window(newlyopened) as required.

Web Scraping Tutorial
Selenium Web Scraper
What Is Web Scraping
Selenium Web Scraping Python

Imagine what would you do if you could automate all the repetitive and boring activities you perform using internet, like checking every day the first results of Google for a given keyword, or download a bunch of files from different websites.

In this post you’ll learn to use Selenium with Python, a Web Scraping tool that simulates a user surfing the Internet. For example, you can use it to automatically look for Google queries and read the results, log in to your social accounts, simulate a user to test your web application, and anything you find in your daily live that it’s repetitive. The possibilities are infinite! 🙂

*All the code in this post has been tested with Python 2.7 and Python 3.4.

Install and use Selenium

Selenium is a python package that can be installed via pip. I recommend that you install it in a virtual environment (using virtualenv and virtualenvwrapper).

To install selenium, you just need to type:

In this post we are going to initialize a Firefox driver — you can install it by visiting their website. However, if you want to work with Chrome or IE, you can find more information here.

Once you have Selenium and Firefox installed, create a python file, selenium_script.py. We are going to initialize a browser using Selenium:

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div><div>16</div><div>18</div><div>20</div><div>22</div><div>24</div><div>26</div><div>28</div><div>30</div><div>32</div></div></td><td><div><div><span>from </span><span>selenium </span><span>import </span><span>webdriver</span></div><div><span>from </span><span>selenium</span><span>.</span><span>webdriver</span><span>.</span><span>support</span><span>.</span><span>ui </span><span>import </span><span>WebDriverWait</span></div><div><span>from </span><span>selenium</span><span>.</span><span>webdriver</span><span>.</span><span>support </span><span>import </span><span>expected_conditions </span><span>as</span><span>EC</span></div><div><span>from </span><span>selenium</span><span>.</span><span>common</span><span>.</span><span>exceptions </span><span>import </span><span>TimeoutException</span></div><div><span>driver</span><span>=</span><span>webdriver</span><span>.</span><span>Firefox</span><span>(</span><span>)</span></div><div><span>return</span><span>driver</span></div><div><span>driver</span><span>.</span><span>get</span><span>(</span><span>'http://www.google.com'</span><span>)</span></div><div><span>box</span><span>=</span><span>driver</span><span>.</span><span>wait</span><span>.</span><span>until</span><span>(</span><span>EC</span><span>.</span><span>presence_of_element_located</span><span>(</span></div><div><span>button</span><span>=</span><span>driver</span><span>.</span><span>wait</span><span>.</span><span>until</span><span>(</span><span>EC</span><span>.</span><span>element_to_be_clickable</span><span>(</span></div><div><span>box</span><span>.</span><span>send_keys</span><span>(</span><span>query</span><span>)</span></div><div><span>except </span><span>TimeoutException</span><span>:</span></div><div><span>if</span><span>__name__</span><span>'__main__'</span><span>:</span></div><div><span>lookup</span><span>(</span><span>driver</span><span>,</span><span>'Selenium'</span><span>)</span></div><div><span>driver</span><span>.</span><span>quit</span><span>(</span><span>)</span></div></div></td></tr></tbody></table><p>In the previous code:</p><img src='https://cdn2.f-cdn.com/files/download/100191969/b6bcaf.jpg' alt='Scraping' title='Scraping' /><ul><li> the function <span>init_driver</span> initializes a driver instance.<ul><li> creates the driver instance</li><li> adds the <span>WebDriverWait</span> function as an attribute to the driver, so it can be accessed more easily. This function is used to make the driver wait a certain amount of time (here 5 seconds) for an event to occur.</li></ul></li><li> the function <span>lookup</span> takes two arguments: a driver instance and a query lookup (a string).<ul><li> it loads the Google search page</li><li> it waits for the query box element to be located and for the button to be clickable. Note that we are using the <span>WebDriverWait</span> function to wait for these elements to appear.</li><li> Both elements are located by name. Other options would be to locate them by <span><span>ID</span><span>,</span><span>XPATH</span><span>,</span><span>TAG_NAME</span><span>,</span><span>CLASS_NAME</span><span>,</span><span>CSS_SELECTOR</span></span> , etc (see table below). You can find more information here.</li><li> Next, it sends the query into the box element and clicks the search button.</li><li> If either the box or button are not located during the time established in the wait function (here, 5 seconds), the <span>TimeoutException</span> is raised.</li></ul></li><li> the next statement is a conditional that is true only when the script is run directly. This prevents the next statements to run when this file is imported.<ul><li> it initializes the driver and calls the lookup function to look for “Selenium”.</li><li> it waits for 5 seconds to see the results and quits the driver</li></ul></li></ul><p>Finally, run your code with:</p><p>Did it work? If you got an <span>ElementNotVisibleException</span> , keep reading!</p><h2>How to catch an ElementNotVisibleExcpetion</h2><p>Google search has recently changed so that initially, Google shows this page:</p><h2 id='web-scraping-tutorial'>Web Scraping Tutorial</h2><p>and when you start writing your query, the search button moves into the upper part of the screen.</p><p>Well, actually it doesn’t move. The old button becomes invisible and the new one visible (and thus the exception when you click the old one: it’s not visible to click!).</p><p>We can update the lookup function in our code so that it catches this exception:</p><div><textarea wrap='soft' readonly='>from selenium.common.exceptions import ElementNotVisibleException def lookup(driver, query): driver.get('http://www.google.com') try: box = driver.wait.until(EC.presence_of_element_located( (By.NAME, 'q'))) button = driver.wait.until(EC.element_to_be_clickable( (By.NAME, 'btnK'))) box.send_keys(query) try: button.click() except ElementNotVisibleException: button = driver.wait.until(EC.visibility_of_element_located( (By.NAME, 'btnG'))) button.click() except TimeoutException: print('Box or Button not found in google.com')

from selenium.common.exceptions import ElementNotVisibleException

def lookup(driver,query):

try:

box=driver.wait.until(EC.presence_of_element_located(

button=driver.wait.until(EC.element_to_be_clickable(

box.send_keys(query)

button.click()

button=driver.wait.until(EC.visibility_of_element_located(

button.click()

print('Box or Button not found in google.com')

the element that raised the exception, button.click() is inside a try statement.
if the exception is raised, we look for the second button, using visibility_of_element_located to make sure the element is visible, and then click this button.
if at any time, some element is not found within the 5 second period, the TimeoutException is raised and caught by the two end lines of code.
Note that the initial button name is “btnK” and the new one is “btnG”.

Method list in Selenium

To sum up, I’ve created a table with the main methods used here.

Note: it’s not a python file — don’t try to run/import it 🙂

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div><div>16</div><div>18</div><div>20</div><div>22</div><div>24</div><div>26</div><div>28</div><div>30</div><div>32</div></div></td><td><div><div><span>from </span><span>selenium </span><span>import </span><span>webdriver</span></div><div><span>from </span><span>selenium</span><span>.</span><span>webdriver</span><span>.</span><span>support</span><span>.</span><span>ui </span><span>import </span><span>WebDriverWait</span></div><div><span>driver</span><span>=</span><span>webdriver</span><span>.</span><span>Firefox</span><span>(</span><span>)</span></div><div><span># WAIT FOR ELEMENTS</span></div><div><span>from </span><span>selenium</span><span>.</span><span>webdriver</span><span>.</span><span>support </span><span>import </span><span>expected_conditions </span><span>as</span><span>EC</span></div><div><span>element</span><span>=</span><span>driver</span><span>.</span><span>wait</span><span>.</span><span>until</span><span>(</span></div><div><span>EC</span><span>.</span><span>element_to_be_clickable</span><span>(</span></div><div><span>(</span><span>By</span><span>.</span><span>NAME</span><span>,</span><span>'name'</span><span>)</span></div><div><span>(</span><span>By</span><span>.</span><span>LINK_TEXT</span><span>,</span><span>'link text'</span><span>)</span></div><div><span>(</span><span>By</span><span>.</span><span>TAG_NAME</span><span>,</span><span>'tag name'</span><span>)</span></div><div><span>(</span><span>By</span><span>.</span><span>CSS_SELECTOR</span><span>,</span><span>'css selector'</span><span>)</span></div><div><span>)</span></div><div><span># CATCH EXCEPTIONS</span></div><div><span>TimeoutException</span></div></div></td></tr></tbody></table><p>That’s all! Hope it was useful! 🙂</p><p>Don’t forget to share it with your friends!</p><h2>Beginner's guide to web scraping with python's selenium</h2><p>In the first part of this series, we introduced ourselves to the concept of web scraping using two python libraries to achieve this task. Namely, requests and BeautifulSoup. The results were then stored in a JSON file. In this walkthrough, we'll tackle web scraping with a slightly different approach using the selenium python library. We'll then store the results in a CSV file using the pandas library.</p><p>The code used in this example is on github.</p><h3>Why use selenium</h3><p>Selenium is a framework which is designed to automate test for web applications.You can then write a python script to control the browser interactions automatically such as link clicks and form submissions. However, in addition to all this selenium comes in handy when we want to scrape data from javascript generated content from a webpage. That is when the data shows up after many ajax requests. Nonetheless, both BeautifulSoup and scrapy are perfectly capable of extracting data from a webpage. The choice of library boils down to how the data in that particular webpage is rendered.</p><p>Other problems one might encounter while web scraping is the possibility of your IP address being blacklisted. I partnered with scraper API, a startup specializing in strategies that'll ease the worry of your IP address from being blocked while web scraping. They utilize IP rotation so you can avoid detection. Boasting over 20 million IP addresses and unlimited bandwidth.</p><p>In addition to this, they provide CAPTCHA handling for you as well as enabling a headless browser so that you'll appear to be a real user and not get detected as a web scraper. For more on its usage, check out my post on web scraping with scrapy. Although you can use it with both BeautifulSoup and selenium.</p><p>If you want more info as well as an intro the scrapy library check out my post on the topic.</p><p>Using this scraper api link and the codelewis10, you'll get a 10% discount off your first purchase!</p><p>For additional resources to understand the selenium library and best practices, this article by towards datascience and accordbox.</p><img src='https://firstchoicepartyhire.com/pictures/selenium-commands-with-examples-pdf-2.jpg' alt='Selenium Web Scrapping' title='Selenium Web Scrapping' /><h3>Setting up</h3><p>We'll be using two python libraries. selenium and pandas. To install them simply run <code>pip install selenium pandas</code></p><p>In addition to this, you'll need a browser driver to simulate browser sessions.Since I am on chrome, we'll be using that for the walkthrough.</p><h4>Driver downloads</h4><ol><li>Chrome.</li></ol><h4>Getting started</h4><p>For this example, we'll be extracting data from quotes to scrape which is specifically made to practise web scraping on.We'll then extract all the quotes and their authors and store them in a CSV file.</p><p>The code above is an import of the chrome driver and pandas libraries.We then make an instance of chrome by using <code>driver = Chrome(webdriver)</code>Note that the webdriver variable will point to the driver executable we downloaded previously for our browser of choice. If you happen to prefer firefox, import like so</p><h4>Main script</h4><h2 id='selenium-web-scraper'>Selenium Web Scraper</h2><p>On close inspection of the sites URL, we'll notice that the pagination URL is<code>Http://quotes.toscrape.com/js/page/{{current_page_number}}/</code></p><p>where the last part is the current page number. Armed with this information, we can proceed to make a page variable to store the exact number of web pages to scrape data from. In this instance, we'll be extracting data from just 10 web pages in an iterative manner.</p><p>The <code>driver.get(url)</code> command makes an HTTP get request to our desired webpage.From here, it's important to know the exact number of items to extract from the webpage.From our previous walkthrough, we defined web scraping as</p><p>This is the process of extracting information from a webpage by taking advantage of patterns in the web page's underlying code.</p><p>We can use web scraping to gather unstructured data from the internet, process it and store it in a structured format.</p><p>On inspecting each quote element, we observe that each quote is enclosed within a div with the class name of quote. By running the directive <code>driver.get_elements_by_class('quote')</code>we get a list of all elements within the page exhibiting this pattern.</p><h4>Final step</h4><p>To begin extracting the information from the webpages, we'll take advantage of the aforementioned patterns in the web pages underlying code.</p><p>We'll start by iterating over the <code>quote</code> elements, this allows us to go over each quote and extract a specific record.From the picture above we notice that the quote is enclosed within a span of class text and the author within the small tag with a class name of author.</p><p>Finally, we store the quote_text and author names variables in a tuple which we proceed to append to the python list by the name total.</p><p>Using the pandas library, we'll initiate a dataframe to store all the records(total list) and specify the column names as quote and author.Finally, export the dataframe to a CSV file which we named quoted.csv in this case.</p><p>Don't forget to close the chrome driver using driver.close().</p><h3>Adittional resources</h3><img src='https://i.stack.imgur.com/WFhhC.png' alt='Selenium webdriver tutorial for beginners' title='Selenium webdriver tutorial for beginners' /><h4>1. finding elements</h4><p>You'll notice that I used the find_elements_by_class method in this walkthrough. This is not the only way to find elements. This tutorial by Klaus explains in detail how to use other selectors.</p><h2 id='what-is-web-scraping'>What Is Web Scraping</h2><h4>2. Video</h4><p>If you prefer to learn using videos this series by Lucid programming was very useful to me.https://www.youtube.com/watch?v=zjo9yFHoUl8</p><h2 id='selenium-web-scraping-python'>Selenium Web Scraping Python</h2><h4>3. Best practises while using selenium</h4><h4>4. Toptal's guide to modern web scraping with selenium</h4><p>And with that, hopefully, you too can make a simple web scraper using selenium 😎.</p><p>If you enjoyed this post subscribe to my newsletter to get notified whenever I write new posts.</p><h4>open to collaboration</h4><p>I recently made a collaborations page on my website. Have an interesting project in mind or want to fill a part-time role?You can now book a session with me directly from my site.</p><p>Thanks.</p><br><br><a href='https://netlify.mix-goapp.com/winrar-archiver-for-mac-os-x.html#oaWzVP=BQVCEAECDVdVRwYHCQYABA8LAVsfRFwKU11cRwgcFQMDH0cBEAUUQl5cVR4IBwRPU09GEVxQSwdbGFhTBhxRSk0AGlBSVlUeAh4AHlxYGDYwTwNVBhlXA0JfXFQcGQMWERgYFgccEEESAVMXCldm' target='_blank'><img src='https://cdn-ak.f.st-hatena.com/images/fotolife/r/ruriatunifoefec/20200910/20200910011329.png' style='cursor:pointer;display:block;margin-left:auto;margin-right:auto;'></a><br><br></p>
				
		<div class="clear"></div>
				
		
								
		
	</div>
	
		
<div id="comments">
		 
				
			<span class="comments-off">Coments are closed</span>
		
		
		
</div>		
	
</div>					
					<div class="navigation">
							<span class="nav-previous"><a href='/iphemeris-astrology'>IPhemeris Astrology</a></span>
							<span class="nav-next"><a href='/unsuccessful-domain-name-resolution-anyconnect'>Unsuccessful Domain Name Resolution Anyconnect</a></span>
					</div>
						
			
				
			</div>
			
<div id="sidebar">
	<div id="search-3" class="widget"><form role="search" method="get" id="searchform" action="#" >
    <div>
    <input type="text" value="" name="s" id="s" />
    <input type="submit" id="searchsubmit" value="Search" />
    </div>
    </form></div>		<div id="recent-posts-5" class="widget">		<h2>MOST POPULAR ARTICLES</h2>		<ul>
					<li><a href='/silicon-atomic-mass'>: Silicon Atomic Mass</a></li>
<li><a href='/google-chrome-for-safari'>: Google Chrome For Safari</a></li>
<li><a href='/filter-forge'>: Filter Forge</a></li>
<li><a href='/winamp-pro'>: Winamp Pro</a></li>
<li><a href='/marine-aquarium'>: Marine Aquarium</a></li>
<li><a href='/smash-ultimate-min-min'>: Smash Ultimate Min Min</a></li>
				</ul>
		</div>					
	
</div>	
<div class="clear"></div>			
	
		</div>
		
		<div class="clear"></div>		
		
		<div id="footer">
		
				
		
			<div class="footer-inner">
				<span class="footcreditleft">© 2021 - Lawhunter220</span>
				<span class="footcreditright"> 
</span>
				<div class="clear"></div>
			</div>	
		</div>
	
	
	</div>
	
	<a href="#" class="scrollup">Scroll to top</a>			
	
	
</body>
</html>