urllib.parse

Parse (absolute and relative) URLs.

urlparse module is based upon the following RFC specifications.

RFC 3986 (STD66): "Uniform Resource Identifiers" by T. Berners-Lee, R. Fielding
and L.  Masinter, January 2005.

RFC 2732 : "Format for Literal IPv6 Addresses in URL's by R.Hinden, B.Carpenter
and L.Masinter, December 1999.

RFC 2396:  "Uniform Resource Identifiers (URI)": Generic Syntax by T.
Berners-Lee, R. Fielding, and L. Masinter, August 1998.

RFC 2368: "The mailto URL scheme", by P.Hoffman , L Masinter, J. Zawinski, July 1998.

RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June
1995.

RFC 1738: "Uniform Resource Locators (URL)" by T. Berners-Lee, L. Masinter, M.
McCahill, December 1994

RFC 3986 is considered the current standard and any future changes to
urlparse module should conform with it.  The urlparse module is
currently not entirely compliant with this RFC due to defacto
scenarios for parsing, and for backward compatibility purposes, some
parsing quirks from older RFCs are retained. The testcases in
test_urlparse.py provides a good indicator of parsing behavior.

Classes

DefragResult

count(self, value, /)

  Return number of occurrences of value.

encode(self, encoding='ascii', errors='strict')

geturl(self)

index(self, value, start=0, stop=9223372036854775807, /)

  Return first index of value.

  Raises ValueError if the value is not present.

fragment = _tuplegetter(1, '\nFragment identifier separated from URL, that allows indirect identification of a\nsecondary resource by reference to a primary resource and additional identifying\ninformation.\n')

  Fragment identifier separated from URL, that allows indirect identification of a
  secondary resource by reference to a primary resource and additional identifying
  information.

url = _tuplegetter(0, 'The URL with no fragment identifier.')
  The URL with no fragment identifier.

DefragResultBytes

count(self, value, /)

  Return number of occurrences of value.

decode(self, encoding='ascii', errors='strict')

geturl(self)

index(self, value, start=0, stop=9223372036854775807, /)

  Return first index of value.

  Raises ValueError if the value is not present.

fragment = _tuplegetter(1, '\nFragment identifier separated from URL, that allows indirect identification of a\nsecondary resource by reference to a primary resource and additional identifying\ninformation.\n')

  Fragment identifier separated from URL, that allows indirect identification of a
  secondary resource by reference to a primary resource and additional identifying
  information.

url = _tuplegetter(0, 'The URL with no fragment identifier.')
  The URL with no fragment identifier.

ParseResult

count(self, value, /)

  Return number of occurrences of value.

encode(self, encoding='ascii', errors='strict')

geturl(self)

index(self, value, start=0, stop=9223372036854775807, /)

  Return first index of value.

  Raises ValueError if the value is not present.

fragment = _tuplegetter(5, '\nFragment identifier, that allows indirect identification of a secondary resource\nby reference to a primary resource and additional identifying information.\n')

  Fragment identifier, that allows indirect identification of a secondary resource
  by reference to a primary resource and additional identifying information.

hostname = <property object at 0x7f02271499f0>

netloc = _tuplegetter(1, '\nNetwork location where the request is made to.\n')

  Network location where the request is made to.

params = _tuplegetter(3, '\nParameters for last path element used to dereference the URI in order to provide\naccess to perform some operation on the resource.\n')

  Parameters for last path element used to dereference the URI in order to provide
  access to perform some operation on the resource.

password = <property object at 0x7f02271499a0>

path = _tuplegetter(2, '\nThe hierarchical path, such as the path to a file to download.\n')

  The hierarchical path, such as the path to a file to download.

port = <property object at 0x7f0227149a40>

query = _tuplegetter(4, "\nThe query component, that contains non-hierarchical data, that along with data\nin path component, identifies a resource in the scope of URI's scheme and\nnetwork location.\n")

  The query component, that contains non-hierarchical data, that along with data
  in path component, identifies a resource in the scope of URI's scheme and
  network location.

scheme = _tuplegetter(0, 'Specifies URL scheme for the request.')
  Specifies URL scheme for the request.

username = <property object at 0x7f0227149950>

ParseResultBytes

count(self, value, /)

  Return number of occurrences of value.

decode(self, encoding='ascii', errors='strict')

geturl(self)

index(self, value, start=0, stop=9223372036854775807, /)

  Return first index of value.

  Raises ValueError if the value is not present.

fragment = _tuplegetter(5, '\nFragment identifier, that allows indirect identification of a secondary resource\nby reference to a primary resource and additional identifying information.\n')

  Fragment identifier, that allows indirect identification of a secondary resource
  by reference to a primary resource and additional identifying information.

hostname = <property object at 0x7f02271499f0>

netloc = _tuplegetter(1, '\nNetwork location where the request is made to.\n')

  Network location where the request is made to.

params = _tuplegetter(3, '\nParameters for last path element used to dereference the URI in order to provide\naccess to perform some operation on the resource.\n')

  Parameters for last path element used to dereference the URI in order to provide
  access to perform some operation on the resource.

password = <property object at 0x7f02271499a0>

path = _tuplegetter(2, '\nThe hierarchical path, such as the path to a file to download.\n')

  The hierarchical path, such as the path to a file to download.

port = <property object at 0x7f0227149a40>

query = _tuplegetter(4, "\nThe query component, that contains non-hierarchical data, that along with data\nin path component, identifies a resource in the scope of URI's scheme and\nnetwork location.\n")

  The query component, that contains non-hierarchical data, that along with data
  in path component, identifies a resource in the scope of URI's scheme and
  network location.

scheme = _tuplegetter(0, 'Specifies URL scheme for the request.')
  Specifies URL scheme for the request.

username = <property object at 0x7f0227149950>

Quoter

A mapping from bytes (in range(0,256)) to strings.

    String values are percent-encoded byte values, unless the key < 128, and
    in the "safe" set (either the specified safe set, or default set).

clear(...)

  D.clear() -> None.  Remove all items from D.

copy(...)

  D.copy() -> a shallow copy of D.

fromkeys(iterable, value=None, /)

  Create a new dictionary with keys from iterable and values set to value.

get(self, key, default=None, /)

  Return the value for key if key is in the dictionary, else default.

items(...)

  D.items() -> a set-like object providing a view on D's items

keys(...)

  D.keys() -> a set-like object providing a view on D's keys

pop(...)

  D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

  If key is not found, default is returned if given, otherwise KeyError is raised

popitem(self, /)

  Remove and return a (key, value) pair as a 2-tuple.

  Pairs are returned in LIFO (last-in, first-out) order.
  Raises KeyError if the dict is empty.

setdefault(self, key, default=None, /)

  Insert key with a value of default if key is not in the dictionary.

  Return the value for key if key is in the dictionary, else default.

update(...)

  D.update([E, ]**F) -> None.  Update D from dict/iterable E and F.
  If E is present and has a .keys() method, then does:  for k in E: D[k] = E[k]
  If E is present and lacks a .keys() method, then does:  for k, v in E: D[k] = v
  In either case, this is followed by: for k in F:  D[k] = F[k]

values(...)

  D.values() -> an object providing a view on D's values

default_factory = <member 'default_factory' of 'collections.defaultdict' objects>
  Factory for default value called by __missing__().

_NetlocResultMixinStr

encode(self, encoding='ascii', errors='strict')

hostname = <property object at 0x7f02271499f0>

password = <property object at 0x7f02271499a0>

port = <property object at 0x7f0227149a40>

username = <property object at 0x7f0227149950>

SplitResult

count(self, value, /)

  Return number of occurrences of value.

encode(self, encoding='ascii', errors='strict')

geturl(self)

index(self, value, start=0, stop=9223372036854775807, /)

  Return first index of value.

  Raises ValueError if the value is not present.

fragment = _tuplegetter(4, '\nFragment identifier, that allows indirect identification of a secondary resource\nby reference to a primary resource and additional identifying information.\n')

  Fragment identifier, that allows indirect identification of a secondary resource
  by reference to a primary resource and additional identifying information.

hostname = <property object at 0x7f02271499f0>

netloc = _tuplegetter(1, '\nNetwork location where the request is made to.\n')

  Network location where the request is made to.

password = <property object at 0x7f02271499a0>

path = _tuplegetter(2, '\nThe hierarchical path, such as the path to a file to download.\n')

  The hierarchical path, such as the path to a file to download.

port = <property object at 0x7f0227149a40>

query = _tuplegetter(3, "\nThe query component, that contains non-hierarchical data, that along with data\nin path component, identifies a resource in the scope of URI's scheme and\nnetwork location.\n")

  The query component, that contains non-hierarchical data, that along with data
  in path component, identifies a resource in the scope of URI's scheme and
  network location.

scheme = _tuplegetter(0, 'Specifies URL scheme for the request.')
  Specifies URL scheme for the request.

username = <property object at 0x7f0227149950>

SplitResultBytes

count(self, value, /)

  Return number of occurrences of value.

decode(self, encoding='ascii', errors='strict')

geturl(self)

index(self, value, start=0, stop=9223372036854775807, /)

  Return first index of value.

  Raises ValueError if the value is not present.

fragment = _tuplegetter(4, '\nFragment identifier, that allows indirect identification of a secondary resource\nby reference to a primary resource and additional identifying information.\n')

  Fragment identifier, that allows indirect identification of a secondary resource
  by reference to a primary resource and additional identifying information.

hostname = <property object at 0x7f02271499f0>

netloc = _tuplegetter(1, '\nNetwork location where the request is made to.\n')

  Network location where the request is made to.

password = <property object at 0x7f02271499a0>

path = _tuplegetter(2, '\nThe hierarchical path, such as the path to a file to download.\n')

  The hierarchical path, such as the path to a file to download.

port = <property object at 0x7f0227149a40>

query = _tuplegetter(3, "\nThe query component, that contains non-hierarchical data, that along with data\nin path component, identifies a resource in the scope of URI's scheme and\nnetwork location.\n")

  The query component, that contains non-hierarchical data, that along with data
  in path component, identifies a resource in the scope of URI's scheme and
  network location.

scheme = _tuplegetter(0, 'Specifies URL scheme for the request.')
  Specifies URL scheme for the request.

username = <property object at 0x7f0227149950>

Functions

clear_cache

clear_cache()

  Clear the parse cache and the quoters cache.

namedtuple

namedtuple(typename, field_names, *, rename=False, defaults=None, module=None)

  Returns a new subclass of tuple with named fields.

      >>> Point = namedtuple('Point', ['x', 'y'])
      >>> Point.__doc__                   # docstring for the new class
      'Point(x, y)'
      >>> p = Point(11, y=22)             # instantiate with positional args or keywords
      >>> p[0] + p[1]                     # indexable like a plain tuple
      33
      >>> x, y = p                        # unpack like a regular tuple
      >>> x, y
      (11, 22)
      >>> p.x + p.y                       # fields also accessible by name
      33
      >>> d = p._asdict()                 # convert to a dictionary
      >>> d['x']
      11
      >>> Point(**d)                      # convert from a dictionary
      Point(x=11, y=22)
      >>> p._replace(x=100)               # _replace() is like str.replace() but targets named fields
      Point(x=100, y=22)

parse_qs

parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')

  Parse a query given as a string argument.

          Arguments:

          qs: percent-encoded query string to be parsed

          keep_blank_values: flag indicating whether blank values in
              percent-encoded queries should be treated as blank strings.
              A true value indicates that blanks should be retained as
              blank strings.  The default false value indicates that
              blank values are to be ignored and treated as if they were
              not included.

          strict_parsing: flag indicating what to do with parsing errors.
              If false (the default), errors are silently ignored.
              If true, errors raise a ValueError exception.

          encoding and errors: specify how to decode percent-encoded sequences
              into Unicode characters, as accepted by the bytes.decode() method.

          max_num_fields: int. If set, then throws a ValueError if there
              are more than n fields read by parse_qsl().

          separator: str. The symbol to use for separating the query arguments.
              Defaults to &.

          Returns a dictionary.

parse_qsl

parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')

  Parse a query given as a string argument.

          Arguments:

          qs: percent-encoded query string to be parsed

          keep_blank_values: flag indicating whether blank values in
              percent-encoded queries should be treated as blank strings.
              A true value indicates that blanks should be retained as blank
              strings.  The default false value indicates that blank values
              are to be ignored and treated as if they were  not included.

          strict_parsing: flag indicating what to do with parsing errors. If
              false (the default), errors are silently ignored. If true,
              errors raise a ValueError exception.

          encoding and errors: specify how to decode percent-encoded sequences
              into Unicode characters, as accepted by the bytes.decode() method.

          max_num_fields: int. If set, then throws a ValueError
              if there are more than n fields read by parse_qsl().

          separator: str. The symbol to use for separating the query arguments.
              Defaults to &.

          Returns a list, as G-d intended.

quote

quote(string, safe='/', encoding=None, errors=None)

  quote('abc def') -> 'abc%20def'

      Each part of a URL, e.g. the path info, the query, etc., has a
      different set of reserved characters that must be quoted. The
      quote function offers a cautious (not minimal) way to quote a
      string for most of these parts.

      RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
      the following (un)reserved characters.

      unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
      reserved      = gen-delims / sub-delims
      gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
      sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                    / "*" / "+" / "," / ";" / "="

      Each of the reserved characters is reserved in some component of a URL,
      but not necessarily in all of them.

      The quote function %-escapes all characters that are neither in the
      unreserved chars ("always safe") nor the additional chars set via the
      safe arg.

      The default for the safe arg is '/'. The character is reserved, but in
      typical usage the quote function is being called on a path where the
      existing slash characters are to be preserved.

      Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.
      Now, "~" is included in the set of unreserved characters.

      string and safe may be either str or bytes objects. encoding and errors
      must not be specified if string is a bytes object.

      The optional encoding and errors parameters specify how to deal with
      non-ASCII characters, as accepted by the str.encode method.
      By default, encoding='utf-8' (characters are encoded with UTF-8), and
      errors='strict' (unsupported characters raise a UnicodeEncodeError).

quote_from_bytes

quote_from_bytes(bs, safe='/')

  Like quote(), but accepts a bytes object rather than a str, and does
      not perform string-to-bytes encoding.  It always returns an ASCII string.
      quote_from_bytes(b'abc def?') -> 'abc%20def%3f'

quote_plus

quote_plus(string, safe='', encoding=None, errors=None)

  Like quote(), but also replace ' ' with '+', as required for quoting
      HTML form values. Plus signs in the original string are escaped unless
      they are included in safe. It also does not have safe default to '/'.

splitattr

splitattr(url)

splithost

splithost(url)

splitnport

splitnport(host, defport=-1)

splitpasswd

splitpasswd(user)

splitport

splitport(host)

splitquery

splitquery(url)

splittag

splittag(url)

splittype

splittype(url)

splituser

splituser(host)

splitvalue

splitvalue(attr)

to_bytes

to_bytes(url)

unquote

unquote(string, encoding='utf-8', errors='replace')

  Replace %xx escapes by their single-character equivalent. The optional
      encoding and errors parameters specify how to decode percent-encoded
      sequences into Unicode characters, as accepted by the bytes.decode()
      method.
      By default, percent-encoded sequences are decoded with UTF-8, and invalid
      sequences are replaced by a placeholder character.

      unquote('abc%20def') -> 'abc def'.

unquote_plus

unquote_plus(string, encoding='utf-8', errors='replace')

  Like unquote(), but also replace plus signs by spaces, as required for
      unquoting HTML form values.

      unquote_plus('%7e/abc+def') -> '~/abc def'

unquote_to_bytes

unquote_to_bytes(string)

  unquote_to_bytes('abc%20def') -> b'abc def'.

unwrap

unwrap(url)

  Transform a string like '<URL:scheme://host/path>' into 'scheme://host/path'.

      The string is returned unchanged if it's not a wrapped URL.

urldefrag

urldefrag(url)

  Removes any existing fragment from URL.

      Returns a tuple of the defragmented URL and the fragment.  If
      the URL contained no fragments, the second element is the
      empty string.

urlencode

urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x7f02270c23a0>)

  Encode a dict or sequence of two-element tuples into a URL query string.

      If any values in the query arg are sequences and doseq is true, each
      sequence element is converted to a separate parameter.

      If the query arg is a sequence of two-element tuples, the order of the
      parameters in the output will match the order of parameters in the
      input.

      The components of a query arg may each be either a string or a bytes type.

      The safe, encoding, and errors parameters are passed down to the function
      specified by quote_via (encoding and errors only if a component is a str).

urljoin

urljoin(base, url, allow_fragments=True)

  Join a base URL and a possibly relative URL to form an absolute
      interpretation of the latter.

urlparse

urlparse(url, scheme='', allow_fragments=True)

  Parse a URL into 6 components:
      <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

      The result is a named 6-tuple with fields corresponding to the
      above. It is either a ParseResult or ParseResultBytes object,
      depending on the type of the url parameter.

      The username, password, hostname, and port sub-components of netloc
      can also be accessed as attributes of the returned object.

      The scheme argument provides the default value of the scheme
      component when no scheme is found in url.

      If allow_fragments is False, no attempt is made to separate the
      fragment component from the previous component, which can be either
      path or query.

      Note that % escapes are not expanded.

urlsplit

urlsplit(url, scheme='', allow_fragments=True)

  Parse a URL into 5 components:
      <scheme>://<netloc>/<path>?<query>#<fragment>

      The result is a named 5-tuple with fields corresponding to the
      above. It is either a SplitResult or SplitResultBytes object,
      depending on the type of the url parameter.

      The username, password, hostname, and port sub-components of netloc
      can also be accessed as attributes of the returned object.

      The scheme argument provides the default value of the scheme
      component when no scheme is found in url.

      If allow_fragments is False, no attempt is made to separate the
      fragment component from the previous component, which can be either
      path or query.

      Note that % escapes are not expanded.

urlunparse

urlunparse(components)

  Put a parsed URL back together again.  This may result in a
      slightly different, but equivalent URL, if the URL that was parsed
      originally had redundant delimiters, e.g. a ? with an empty query
      (the draft states that these are equivalent).

urlunsplit

urlunsplit(components)

  Combine the elements of a tuple as returned by urlsplit() into a
      complete URL as a string. The data argument can be any five-item iterable.
      This may result in a slightly different, but equivalent URL, if the URL that
      was parsed originally had unnecessary delimiters (for example, a ? with an
      empty query; the RFC states that these are equivalent).

Other members

MAX_CACHE_SIZE = 20

non_hierarchical = ['gopher', 'hdl', 'mailto', 'news', 'telnet', 'wais', 'imap', 'snews', 'sip', 'sips']

scheme_chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+-.'

uses_fragment = ['', 'ftp', 'hdl', 'http', 'gopher', 'news', 'nntp', 'wais', 'https', 'shttp', 'snews', 'file', 'prospero']

uses_netloc = ['', 'ftp', 'http', 'gopher', 'nntp', 'telnet', 'imap', 'wais', 'file', 'mms', 'https', 'shttp', 'snews', 'prospero', 'rtsp', 'rtspu', 'rsync', 'svn', 'svn+ssh', 'sftp', 'nfs', 'git', 'git+ssh', 'ws', 'wss']

uses_params = ['', 'ftp', 'hdl', 'prospero', 'http', 'imap', 'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips', 'mms', 'sftp', 'tel']

uses_query = ['', 'http', 'wais', 'imap', 'https', 'shttp', 'mms', 'gopher', 'rtsp', 'rtspu', 'sip', 'sips']

uses_relative = ['', 'ftp', 'http', 'gopher', 'nntp', 'imap', 'wais', 'file', 'https', 'shttp', 'mms', 'prospero', 'rtsp', 'rtspu', 'sftp', 'svn', 'svn+ssh', 'ws', 'wss']

Modules