Unicode Security Notes Page
Every Unicode Character Blob Page or
TXT file
Every Unicode Character 80 Column Page or
TXT file
Every Unicode Character With Hex Page or
TXT file
Text below is to help with search indexing and copy and pasting, but it is missing some items from the Power Point slides.
Character Assassination:
Fun and games with Unicode
Adrian Crenshaw
About Adrian
I run Irongeek.com
I have an interest in InfoSec education
I don’t know everything - I’m just a geek with time on my hands
Sr. Information Security Engineer at a Fortune 1000
Co-Founder of Derbycon
http://www.derbycon.com/
To be clear concerning what this talk is about
Why this subject?
Lot’s of research has been done, but not many people talk about it
Complexity is the damnable enemy of security, but human language is complex so what can you do?
Act as a setup for future research
To encourage others who are better at exploit development than me to look into it
Because I wanted to make an animation with cartoon letters stabbing each other
Why Unicode
There are more than English Speakers out there
ASCII: American Standard Code for Information Interchange
What about other languages? Cyrillic, Chinese, Hebrew, Arabic, Klingon… ( ok, sort of http://wazu.jp/gallery/Test_Klingon.html )
Unicode lets computer systems support more languages, allowing for world wide use
Unicode History
ASCII is 7 bit and just 96 printable characters, but an 8th bit was added to make other standards:
Extended ASCII
ISO/IEC 8859
ISO/IEC 8859 uses last bit to add another 96+ control characters
You have to specify a part/character set/language to specify those 96
This still was not enough, and did not allow for a lot of mixed languages
The need was to represent all of the characters as unique code points, and not get confused amongst languages
Unicode History
Joe Becker (Xerox), Lee Collins & Mark Davis (Apple) started working on Unicode in 1987 to do this, version 1.0.0 released in Oct 1991
Unicode started as a 16bit character model (0x0-0xFFFF), with the first 256 code points the same as ISO-8859-1
Each character has a code point associated with it:
A = U+0041 $=U+0024 U+265E=♞
This has since been expanded, so Unicode has points from 0x0 to 0x10FFFF (1,114,112 points dec), though support varies
Most used points will be in Basic Multilingual Plane (BMP) represented as U+0000 to U+FFFF
Encodings
UTF-8 (UCS Transformation Format 8-bit), meant to be backward compatible with ASCII
UTF-16 (Unicode Transformation Format 16-bit) which superseded UCS-2
UTF-32 (Unicode Transformation Format 32-bit )
BOM (Byte Order Marks)
UTF-8 prepends EFBBBF to data
UTF-16 FEFF Unicode Big Endian, FFFE Little Endian
UTF-32 generally does not use one
Encoding Examples
Omega U+03A9
AΩB
UTF-8
41 CE A9 42
UTF-16
00 41 03 A9 00 42
UTF-32
00 00 00 41 00 00 03 A9 00 00 00 42
I hate Smart Quotes!
“Smart” "Not so smart" �Smart when dumb� Why?
Microsoft extended ISO 8859-1, making some control characters in 80 to 9F printable for Windows-1252
“ ” ‚ ‘ ’ —
93 94 82 91 92 97
If Windows-1252 is confused for ISO 8859-1, you get � for these characters
Makes copying and pasting command in tutorials a pain!
Related:
Some Email J
Some Email J
UTF-8 Encoding
Lower ASCII is the same in UTF-8, Higher uses continuation bytes (table bogarded from Wikipedia)
UTF-16 Encoding
In UTF-16 U+10000 to U+10FFFF use surrogate pairs in range 0xD800 to 0xD8FF
Steps
based on:
http://en.wikipedia.org/wiki/UTF-16
Mojibake!
Mojibake = "character" "transform“
AΩB✌C
Code Points:
U+0041 U+03a9 U+0042 U+270C U+0043
UTF-8 bye string:
EF BB BF 41 CE A9 42 E2 9C 8C 43
Mangled by reading as just ISO 8859-1 bytes:
AΩB✌C
Find Your Character
Wikipedia List
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Unicode Table
http://unicode-table.com/
File Format
http://www.fileformat.info/info/unicode/
Unicode Code Converter v7.05
http://rishida.net/tools/conversion/
Typing Unicode
Windows:
Alt, + key on keypad, type hex number
May have to edit HKEY_Current_User/Control Panel/Input
Method and set EnableHexNumpad to "1“.
Help from
http://www.fileformat.info/tip/microsoft/enter_unicode.htm
OS X
Option+Command+t will let you select some
System Preferences ->Language & Text->Input Sources
Enable “Unicode Hex Input”
Select U+ from the menu bar
Hold Option Key, type in Hex code
Obligatory XKCD Slide
Homoglyph/Visual Attacks
Confusables and Look-a-likes
Classic Phishing Obfuscations
Would you follow a link in email to AdriansHouseOfPwnage.com?
Text says one thing, link says another:
<a href=”http://irongeek.com”>http://www.microsoft.com</a>
Confuse user with credentials section of a URL:
http://www.microsoft.com@irongeek.com
Firefox pops up a warning
IE just refuses to connect
Other ideas?
Homographs
Homographs = words that looks the same
Homoglyphs = characters that look the same
Examples:
rnicrosoft.com vs. microsoft.com
paypa1.com vs. paypal.com
IR0NGEEK.COM vs. IRONGEEK.COM
Now, what about Unicode?
Problem: DNS is ASCII
DNS labels (the parts separated by dots) follow the LDH rule:
Letters
Digits
Hyphen
This would not allow for international characters in DNS labels
Enter Punycode and IDNA
IDNA
Internationalized Domain Names in Applications (IDNA)
allows non-ASCII characters in the host section of a URL to map to DNS host
names
café.com = xn--caf-dma.com
北京大学.中國
= xn--1lq90ic7fzpc.xn--fiqz9s
What about Homoglyphs in Unicode?
There are homoglyphs in Unicode that look the same as normal Latin characters, and these could be used for spoofing names, examples:
googlе.com = xn--googl-3we.com
(е is a Cyrillic small letter ie U+0435)
іucu.org = xn--ucU+ihd.org
(і is a Cyrillic small letter Byelorussian-Ukrainian
і U+0456)
pаypal.com = xn--pypal-4ve.com
(2nd а is Cyrillic small letter a U+0430)
Likely Sources for Homoglyphs
Cyrillic script: a, c, e, o, p, x and y
Latin alphabet appears twice, U+0021-007E (Basic Latin) &
U+FF01-FF5E (Full width Latin):
!"$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Even some slashes
/(U+002f), ̸ (U+0338), ⁄ (U+2044), ∕(U+2215),
╱ (U+2571),
/ (U+ff0f),
ノ (U+ff89)
Slashes?
Can other domains be used?
www.microsoft.com⁄index.html.irongeek.com
Slash is U+2044
Mouse over it
Homoglyph Attack Generator
Demo
http://www.irongeek.com/homoglyph-attack-generator.php
Combination of JavaScript and PHP libraries created by phlyLabs as part of phlyMail
Protections Implemented by Browsers
Firefox shows Punycode if
Not in TLD White List (about:config→network.IDN.whitelist)
.ac, .ar, .asia, .at, .biz, .br, .cat, .ch, .cl, .cn, .de, .dk, .ee, .es, .fi, .gr,
.hu, .il, .info, .io, .ir, .is, .jp, .kr, .li, .lt, .lu, .lv, .museum, .no, .nu,
.nz, .org, .pl, .pr, .se, .sh, .si, .tel, .th, .tm, .tw, .ua, .vn, .xn--0zwm56d,
.xn--11b5bs3a9aj6g, .xn--80akhbyknj4f, .xn--90a3ac, .xn--9t4b11yi5a, .xn--deba0ad,
.xn--fiqs8s, .xn--fiqz9s, .xn--fzc2c9e2c, .xn--g6w251d, .xn--hgbk6aj7f53bba, .xn--hlcj6aya9esc7a,
.xn--j6w193g, .xn--jxalpdlp, .xn--kgbechtv, .xn--kprw13d, .xn--kpry57d, .xn--mgba3a4f16a,
.xn--mgba3a4fra, .xn--mgbaam7a8h, .xn--mgbayh7gpa, .xn--mgberp4a5d4a87g, .xn--mgberp4a5d4ar,
.xn--mgbqly7c0a67fbc, .xn--mgbqly7cvafr, .xn--o3cw4h, .xn--ogbpf8fl, .xn--p1ai,
.xn--wgbh1c, .xn--wgbl6a, .xn--xkc2al3hye2a, .xn—zckzah
network.IDN_show_punycode set to true (default false)
Any of these blacklisted characters appear:
¼½¾ǃː̷̸։׃״؉؊٪۔܁܂܃܄ᅟᅠ᜵ ․‧
‹›⁁⁄⁒ ⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟∕∶⎮╱⧶⧸⫻⫽⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻ 。〔〕〳ㅤ㈝㈞㎮㎯㏆㏟꞉︔︕︿﹝﹞./。ᅠ�
Updated at
http://kb.mozillazine.org/Network.IDN.blacklist_chars
Protections Implemented by Browsers
IE 9, and I assume 10 shows Punycode if
If there is a mismatch between the characters used in the URL and the language expectation
If character is not used in any language
Mixed set of scripts that do not belong together
Info may be out of date, most material references IE 7
http://msdn.microsoft.com/en-us/library/bb250505%28v=vs.85%29.aspx
Protections Implemented by Browsers
Chrome shows Punycode if
Configured language of the browser (configured in the “Fonts and Languages” options) does not match
Incompatible set of scripts that do not belong
But there is a whitelist, so hard to confuse scripts like Latin with Chinese can be used
Characters in a black list
Defenses by Registrar
Registrars may not allow the character
For example, one registrar gave the following error when an
attempt was made to register іucu.org (Cyrillic small letter
Byelorussian-Ukrainian i U+0456):
“Error: You used an invalid international character! Please note that for some
reason .org and .info only support Danish, German, Hungarian, Icelandic, Korean,
Latvian, Lithuanian, Polish, Spanish, and Swedish international characters.”
May be gotten around by / homoglyphs, ノ Katakana Letter No (U+30ce) seems to work best and a domain you already own
Approach
Used domain we control, and Local Hosts file to map the DNS entries
IE 10.0.8
FireFox 23.0.1
Chrome 28.0.1500.95 mg
Some Results
Other odd balls
іucu.org [xn--ucu-ihd.org](і U+0456 ) could not be registered
These seemed to pass Registrar’s tests
Íucu.org [xn--ucU-2ia.org](Latin capital letter i with acute Í U+0456)
íucu.org [xn--ucU-qma.org](Latin small letter i with acute í U+00ED)
įucu.org [xn--ucU-9ta.org](Latin small letter i with ogonek į U+00ED)
ノ Katakana Letter No (U+30ce) seems to work in Firefox for subdomain trick, but not in Chrome or IE
Display of IDNA in Web Apps
What does the webapp display?
How does it parse links?
Test Strings
Ω U+03A9
http://Ω.com
ɡ U+0261
http://ɡoogle.com
http://ɡoogle.org
і U+0456
іucu.org
http://іucu.org
⁄ U+2044
http://www.microsoft.com⁄index.html.irongeek.com
http://www.microsoft.com⁄index.html.irongeek.org
Outlook 2010
Sent from Gmail to campus mail
Pink phishing warning that must be clicked past to use links
4th, 7th and 8th link had parse errors
Gmail
Sent from Outlook mail to Gmail
2nd and 3rd links used to have problem with ɡ (Latin small letter script G U+0261) but now work
4th link had problems with Cyrillic і (U+0456) if no http:// in front
7th and 8th link had parse errors because of ⁄ (fraction slash U+2044) and were split in two
Seemed to render all but the fourth link as it was inputted Punycode versions show
іucu.org without the preceding http:// gave issues. Cyrillic і (U+0456) seemed to confuse the parser
The ⁄ (fraction slash U+2044) in the last two links seems to also cause no oddities
Twitter had the effect of rendering all of the URLs as a truncated, URL shortened (using t.co), Punycode version
Except іucu.org without the preceding http://. Again, the soft-dotted Cyrillic і (U+0456) seemed to confuse the parser.
Twitter makes it pretty obvious that there is something funny about the URLs
Fonts Matter
Calibri:
@dave_rel1k
@dave_reI1k
AΑᎪAaаaɑα
BΒВᏴᛒBbbЬßʙβ
CϹСᏟⅭC𐒨сcϲⅽc
Courier New:
@dave_rel1k
@dave_reI1k
AΑᎪAaаaɑα
BΒВᏴᛒBbbЬßʙβ
CϹСᏟⅭC𐒨сcϲⅽc
Ok, besides Homoglyphs?
Steganography
“Covered Writing”
Hide Text in text
Easy to detect by looking at the bytes, but may fool the human eye
Some examples looks better than others, Unicode support varying.
Can be used in Botnets:
http://www.irongeek.com/i.php?page=security/steganographic-command-and-control
Play with it here:
http://www.irongeek.com/i.php?page=security/unicode-steganography-homoglyph-encoder
Stego Examples
Alternate between Latin and Full-width Latin, easy, just
add/subtract 65248 decimal. Use U+205F as space
This
is my
cover
text to
use. Do you think it will work? I
hope
that it
will.
Use very close homoglyphs to encode single bits, skip if
there are no close homoglyphs, use 8 types of space like characters (U+0020,
U+2004, U+2005, U+2006,
U+2008, U+2009, U+202F,
U+205F) to encode 3 bits each (000,001,010,011,100,101,110,111)
Τhiѕ іѕ my cover tехt tο usе. Dο yοu thіnk іt wіll wοrk? I
һοре that
it will.
Use non printable Tags in U+E0000 to U+E007F, also easy
just add/subtract 0xE0000
This is my cover text to use. Do you think it will work? I
hope that it will.
Examples:
“It worked?”
Name Spoofing
IP Boards let me spoof Daren from Hak5’s screen name:
Darren Κitchen (U+039A Greek Capital Letter Kappa)
vs
Darren Kitchen
(Post count and admin status will give it away)
Twitter returned the error
“Invalid username! Alphanumerics only.”
Gmail/Google returned the error
“Please use only letters (a-z), numbers, and periods.” when non-ASCII characters
were attempted.
More research needs to be done in these areas.
Right to left?
Josh Kelley mentioned this one to me
What about left to right mixed with right to left scripts?
Takes U+202E (Right-to-Left Override), U+202C stops it
http://irongeek.com/moc.tfosorcim//:ptth
More details at:
http://digitalpbk.blogspot.com/2006/11/fun-with-unicode-and-mirroring.html
&
http://dl.packetstormsecurity.net/papers/general/righttoleften-override.pdf
What about file names?
Just how they are displayed
Non Visual
http://www.unicode.org/reports/tr36/
UTF-8 Exploits
Text Comparison
Buffer Overflows
Property and Character Stability
Deletion of Code Points
Secure Encoding Conversion
Enabling Lossless Conversion
Canonicalization Errors?
Remember when the full width Latin forms were turned to normal Latin in the URL bar?
< or > filtered?
What if it also tries to canonicalize similar characters like < (U+003c), >(U+003e), ‹ (U+2039), ﹤ (U+FE64), ﹥ (U+FE65) › (U+203a), <(U+ff1c), >(U+ff1e) afterwards?
Other Transforms
Case changes
ß (U+00DF) upper case becomes SS
İ (U+0130) to lower case becomes i (U+0069)
ſ (U+017F) to upper becomes S (U+0053)
ẞ (U+1E9E) to lower becomes ß (U+00DF)
ı (U+0131) to upper becomes | (U+0049)
Apparently, locale matters too, French upper case may drop diacritics, Turkish handles “iIıİ” differently
http://www.w3.org/International/wiki/Case_folding
UTF-8 Exploits
Overly long encoding, will it bypass filters?
<
< = 3C = 00111100
11000000 10 111100 = C0 BC
>
> = 3E = 00111110
11000000 10111110 = C0 BE
a1 13 a1 03 a1 12 a1 09 a1 10 a1 14
MS00-057 Was this Problem, but with ../
Text Comparison
(Normalization)
Various characters have both their own code point, and can be made with “Combining” characters
Diacritical marks also A (U+0041) next to U+0300 = À but À is also U+00C0
We want text searches to be equivalent,
NFKC - Normalization Form Compatibility Composition
"Ⓓⓔⓛⓔⓣⓔ" into "delete".
International Phonetic Alphabet has examples in U+0300 to U+036F. Even more in U+1DC0 to U+1DFF
Real-life Example: Spotify
The canonical_username function was not “idempotent” (only first time matters), Function like “toLower” would be.
Users signs up with username IronGeek, normalized to irongeek
Another user signs up as ᴵᴿᴼᴺᴳᴱᴱᴷ (U+1D35
U+1D3F U+1D3C U+1D3A U+1D33 U+1D31 U+1D31 U+1D37 in Phonetic Extensions
block)
Which also gets normalized to IRONGEEK the first time, but irongeek the next
time.
ᴵᴿᴼᴺᴳᴱᴱᴷ requests a password reset email, but with it can reset IronGeek’s account
Full story here:
http://labs.spotify.com/2013/06/18/creative-usernames/
Thwart Searches/Obscenity Filters
What if you want to be public, by hard to search for?
What if you wan to search for filtered words?
Classic example, no Unicode needed: pr0n
Porn != Pοrn != Pоrn
o=U+006f, ο=U+03bf, о=U+043e
Latin Small o, Greek Small Omicron, Cyrillic Small Letter o
Searches for the above turcn up different results in Google
Some items with mixed scripts just get flagged as spam
Just plain fun too
Buffer Overflows
Some expand out
Complexities With Buffer Overflows
Try to overwrite EIP with 0x41414141, you get 0x00410041
Chris Anley came up with “Venetian Shellcode”
Links:
http://www.ngssoftware.com/papers/unicodebo.pdf
https://www.corelan.be/index.php/2009/11/06/exploit-writing-tutorial-part-7-unicode-from-0x00410041-to-calc/
FX of Phenoelit also did some work on this
Fuzzing
Suggestions::
Combining Diacritics
Invisible Characters
Malformed UTF-8
Bad Surrogate Pairs
Multiple levels or RTL, LTR reversing
Chris Weber’s Blog:
http://web.lookout.net/2011/06/special-unicode-characters-for-error.html
In recent news, Apple's CoreText API Bug:
سمَـَّوُوُحخ ̷̴̐خ
̷̴̐خ
̷̴̐خ
امارتيخ ̷̴̐خ
http://arstechnica.com/apple/2013/08/rendering-bug-crashes-os-x-and-ios-apps-with-string-of-arabic-characters/
&
MS13-060 Vulnerability in Unicode Scripts Processor Could Allow Remote Code
Execution (2850869)
Big Thanks
J. Abolins
@jabolins
Chris Weber
@w3be http://www.casaba.com
Michal Zalewski
@lcamtuf
http://nostarch.com/tangledweb
William Coppola
@SubINacls
Useful Sites
Unicode Security Considerations
http://unicode.org/reports/tr36/
Unicode Security Mechanisms
http://www.unicode.org/reports/tr39/
Unicode Converter
http://www.rishida.net/tools/conversion/
Unicode Character Info and List
http://www.fileformat.info/
Homoglyph Attack Generator
http://www.irongeek.com/homoglyph-attack-generator.php
Unicode-HAX
https://github.com/cweb/unicode-hax
OWASP XSS Filter Evasion Cheat Sheet
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
Fun
Unicode “Fonts”
http://www.panix.com/~eli/unicode/convert.cgi
Other Fun
http://txtn.us
Art
Hand are based on
http://www.newthinktank.com/2010/10/cartoon-hands/
References
A. Costello, March 2003. [Online]. Available: http://www.ietf.org/rfc/rfc3492.txt
J. Abolins, December 2010. [Online]. Available: http://www.irongeek.com/i.php?page=videos/dojocon-2010-videos#Internationalized%20Domain%20Names%20&%20Investigations%20in%20the%20Networked%20World
M. Zalewski, The Tangled Web: A Guide to Securing Modern Web Applications, 1st ed., No Starch Press, 2011.
E. &. G. A. Gabrilovich, "The Homograph Attack," Communications of the ACM , vol. 45, no. 2, 2002.
V. Krammer, "Phishing defense against IDN address spoofing attacks," in Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services , New York, NY, USA, 2006
E. Johanson, "The state of homograph attacks," 2005. [Online]. Available: http://www.shmoo.com/idn/. [Accessed 24 4 2012].
D. Kennedy. [Online]. Available: http://www.secmaniac.com/download/
A. Crenshaw, 2012. [Online]. Available: http://www.irongeek.com/homoglyph-attack-generator.php
phlyLabs, 2012. [Online]. Available: http://phlymail.com
Microsoft, September 2006. [Online]. Available: http://msdn.microsoft.com/en-us/library/bb250505%28VS.85%29.aspx
Chromium Project, [Online]. Available: http://www.chromium.org/developers/design-documents/idn-in-google-chrome
C. Weber, July 2009. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-SLIDES.pdf.
C. Weber, seems to be longer version of presentation above http://www.casaba.com/files/Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf
C. Weber, July 2009. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-PAPER.pdf
A. Crenshaw, "Steganographic Command and Control: Building a communication channel that withstands hostile scrutiny," 2010. [Online]. Available: http://www.irongeek.com/i.php?page=security/steganographic-command-and-control [Accessed 23rd April 2012]
Events
Derbycon
Sept 25th-29th 2013
http://www.derbycon.com
Others
Questions?
42
Twitter: @Irongeek_ADC
If you would like to republish one of the articles from this site on your webpage or print journal please contact IronGeek.
Copyright 2020, IronGeek