Programming has made it easy to deal with structured and unstructured textual data. Tools like regular expressions and external libraries make these tasks a lot easier.


You can use most languages, including Python and JavaScript, to validate URLs using a regular expression. This example regex isn’t perfect, but you can use it to check URLs for simple use cases.

MAKEUSEOF VIDEO OF THE DAYSCROLL TO CONTINUE WITH CONTENT

The regex to validate a URL presented in this article is not perfect. There may be multiple examples of valid URLs that may fail this regex validation. This includes URLs involving IP addresses, non-ASCII characters, and protocols like FTP. The following regex only validates the most common URLs.

Regular Expression to Validate a URL

Structure of URL

The regex will consider a URL valid if it satisfies the following conditions:

  1. The string should start with either http or https followed by ://.
  2. The combined length of the sub-domain and root domain must be between 2 and 256. It should only contain alphanumeric characters and/or special characters.
  3. The TLD (Top-Level Domain) should only contain alphabetic characters and it should be between two and six characters long.
  4. The end of the URL string could contain alphanumeric characters and/or special characters. And it could repeat zero or more times.

You can validate a URL in JavaScript using the following regular expression:

 ^(http(s):\/\/.)[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)$ 

Similarly, you can use the following regex for URL validation in Python:

 ^((http|https)://)[-a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)$ 

Where:

  • (http|https)://) makes sure the string starts with either http or https followed by ://.
  • [-a-zA-Z0-9@:%._\\+~#?&//=] indicates alphanumeric characters and/or special characters. The first instance of this set represents the set of characters to allow in the sub-domain and root domain parts. While the second instance of this set represents the set of characters to allow in the query string or subdirectory part.
  • {2,256} represents 2 to 256 (both inclusive) times occurrence indicator. This indicates that the combined length of the subdomain and domain must be between two and 256.
  • \. represents the dot character.
  • [a-z]{2,6} means any lowercase letters from a to z with a length between two and six. This represents the set of characters to allow in the top-level domain part.
  • \b represents the boundary of a word, i.e. the start of a word or the end of one.
  • * is a repetition operator which indicates zero or more copies of the query string, parameters, or subdirectories.
  • ^ and $ indicate the start and end of the string respectively.

If you are uncomfortable with the above expression, check out a beginner’s guide to regular expressions first. Regular expressions take some time to get used to. Exploring some examples like validating user account details using regular expressions should help.

The above regex satisfies the following types of URLs:

  • https://www.something.com/
  • http://www.something.com/
  • https://www.something.edu.co.in
  • http://www.url-with-path.com/path
  • https://www.url-with-querystring.com/?url=has-querystring
  • http://url-without-www-subdomain.com/
  • https://mail.google.com

Using the Regular Expression in a Program

The code used in this project is available in a GitHub repository and is free for you to use under the MIT license.

This is a Python approach to validating a URL:

 import re
 
def validateURL(url):
    regex = "^((http|https)://)[-a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)$"
    r = re.compile(regex)
 
    if (re.search(r, url)):
        print("Valid")
    else:
        print("Not Valid")
 
url1 = "https://www.linkedin.com/"
validateURL(url1)
url2 = "http://apple"
validateURL(url2)
url3 = "iywegfuykegf"
validateURL(url3)
url4 = "https://w"
validateURL(url4)

This code uses Python’s re.compile() method to compile the regular expression pattern. This method accepts the regex pattern as a string parameter and returns a regex pattern object. This regex pattern object is further used to look for occurrences of the regex pattern inside the target string using the re.search() method.

If it finds at least one match, the re.search() method returns the first match. Note that if you want to search for all the matches to the pattern from the target string, you need to use the re.findall() method.

Running the above code will confirm that the first URL is valid but the rest of them are not.

terminal output for url regex validator

Similarly, you can validate a URL in JavaScript using the following code:

 function validateURL(url) {
   if(/^(http(s):\/\/.)[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)$/g.test(url)) {
        console.log('Valid');
    } else {
        console.log('Not Valid');
    }
}
 
validateURL("https://www.linkedin.com/");
validateURL("http://apple");
validateURL("iywegfuykegf");
validateURL("https://w");

Again, running this code will confirm that the first URL is valid and the rest of them are invalid. It uses JavaScript’s match() method to match the target string against a regular expression pattern.

Real-World Examples and Use Cases of URL Validation Using Regex

URL validation using regex can be crucial in several web development and data processing scenarios. Here are a few real-world examples and use cases:

  • Using regex, you can ensure that URLs submitted through a web form are in the correct format, preventing errors or security vulnerabilities. You can validate URLs in contact forms, user profile links, or input fields that require website URLs.
  • Regular expressions let you filter and extract specific URLs based on their patterns or domains. This ensures that you only collect relevant data from the desired sources and avoid processing irrelevant or potentially harmful URLs.
  • For apps dealing with large amounts of data containing URLs, such as social media platforms or content management systems, validating links is essential. Regular expressions can help identify broken or invalid URLs. This lets you perform link verification and take appropriate actions, such as removing or flagging invalid links.

Validate Important Data Using Regular Expressions

You can use regular expressions to search, match, or parse text. They are also used for natural language processing, pattern matching, and lexical analysis.

You can use this powerful tool to validate important types of data like credit card numbers, user account details, IP addresses, and more.

#Validate #URL #Regular #Expressions

Categorized in: