Une expression régulière qui permet de vérifier qu’une chaîne de
caractère est bien une URL et de récupérer ses différentes parties.
Elle a originellement été postée par John Gruber mais ensuite
enrichie par Tom Winzig pour une version qui supporte notamment
les domaines internationaux.
# Single-line version:
(?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?))
# Commented multi-line version:
(?xi)
\b
(https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes
( # Capture $2: Entire matched URL (other than optional protocol://)
(?:
(?:
[\w.\-]+\. # looks like domain name
(?:[a-z]{2,13}) # ending in common popular gTLDs
| #
(?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https://
)
\/ # followed by a slash
)
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)+
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)
| # OR, the following to match naked domains:
(?:
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.])
(?:
\w+
(?:[.\-]+\w+)*
\. # avoid matching the last two parts of an email domain like co.uk in person@amazon.co.uk
(?:[a-z]{2,13}) # ending in common popular gTLDs
| # or
(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558
)
\b
\/?
(?!@) # not succeeded by a @, avoid matching "foo.na" in "foo.na@example.com"
(?: # One or more:
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[]
| # or
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
)*
(?: # End with:
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…)
|
\([^\s]+?\) # balanced parens, non-recursive: (…)
| # or
[^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars
)?
)
)
importreMD_IMAGE_PATTERN=re.compile(r""" !\[ # regular markdown image starter (?P<alt>.*?) # alternative text to the image \]\( # close alt text and start url (?P<url>.*?) # url of the image (?:\s*\"(?P<title>.*?)?\")? # optional text to set a title \) # close of url-part parenthesis """,re.VERBOSE|re.MULTILINE,)