whatwg:big5.git
4 years agoImplement the error handling suggested in bug 16771 master
Philip Jägenstedt [Wed, 11 Sep 2013 15:58:55 +0000 (17:58 +0200)]
Implement the error handling suggested in bug 16771

https://www.w3.org/Bugs/Public/show_bug.cgi?id=16771

This produces the desired output for the invalid-trail/ samples.

4 years agoAdd a browser test page and the script that generated it
Philip Jägenstedt [Wed, 11 Sep 2013 15:47:20 +0000 (17:47 +0200)]
Add a browser test page and the script that generated it

4 years agoCategorize invalid trail bytes
Philip Jägenstedt [Tue, 10 Sep 2013 20:52:29 +0000 (22:52 +0200)]
Categorize invalid trail bytes

Look at the errors in context to determine if it is best to
rewind (per current spec) or to just skip the byte.

orig: rewinding works best

skip: skipping the byte works best

either: it doesn't matter, either because it wasn't possible to
tell from context which is better or because of misencoding.

nova: misencoded junk from forum.nova.com.tw

4 years agoRun invalid-trail.py on hk-data and tw-data
Philip Jägenstedt [Mon, 9 Sep 2013 09:02:56 +0000 (11:02 +0200)]
Run invalid-trail.py on hk-data and tw-data

http://html5.org/temp/hk-data.tar.gz (199M)
SHA1: 26b5af227bd0c72280aeeba39b22d712fa8d6cae

http://html5.org/temp/tw-data.tar.gz (708M)
SHA1: 555c3a9dce5f93d00e9ae47e901091f6140bce52

4 years agoAdd analysis script for invalid trail bytes
Philip Jägenstedt [Mon, 9 Sep 2013 09:02:18 +0000 (11:02 +0200)]
Add analysis script for invalid trail bytes

5 years agoSync index-big5.txt with spec
Philip Jägenstedt [Mon, 23 Apr 2012 11:46:46 +0000 (13:46 +0200)]
Sync index-big5.txt with spec

https://www.w3.org/Bugs/Public/show_bug.cgi?id=16822

5 years agoBig5-2003 mapping
Philip Jägenstedt [Sun, 22 Apr 2012 10:25:54 +0000 (12:25 +0200)]
Big5-2003 mapping

http://moztw.org/docs/big5/table/big5_2003-b2u.txt

These say the same, modulo formatting:

http://www.csie.ntu.edu.tw/~r92030/project/big5/big5uni.txt
http://opensource.apple.com/source/libiconv/libiconv-30/libiconv/tests/BIG5-2003.TXT

5 years agoCompare Firefox to UAO (10 differences)
Philip Jägenstedt [Sun, 22 Apr 2012 07:21:39 +0000 (09:21 +0200)]
Compare Firefox to UAO (10 differences)

Firefox uses "DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT *" while UAO
uses "DINGBAT NEGATIVE CIRCLED DIGIT *".

9841 =>
firefox: U+278A ➊
uao: U+2776 ❶
9842 =>
firefox: U+278B ➋
uao: U+2777 ❷
9843 =>
firefox: U+278C ➌
uao: U+2778 ❸
9844 =>
firefox: U+278D ➍
uao: U+2779 ❹
9845 =>
firefox: U+278E ➎
uao: U+277A ❺
9846 =>
firefox: U+278F ➏
uao: U+277B ❻
9847 =>
firefox: U+2790 ➐
uao: U+277C ❼
9848 =>
firefox: U+2791 ➑
uao: U+277D ❽
9849 =>
firefox: U+2792 ➒
uao: U+277E ❾
984A =>
firefox: U+2793 ➓
uao: U+277F ❿

5 years agoRename misencoded file names
Philip Jägenstedt [Sun, 22 Apr 2012 06:37:14 +0000 (08:37 +0200)]
Rename misencoded file names

7-Zip apparently used Windows-1252, but this is Big5.

The files still aren't useful, but at least pretty now.

5 years agoExtract unicodeaton_250.exe (using 7-Zip on Windows 7)
Philip Jägenstedt [Sun, 22 Apr 2012 06:29:18 +0000 (08:29 +0200)]
Extract unicodeaton_250.exe (using 7-Zip on Windows 7)

5 years agoUnicode 補完計畫 2.50 (unicodeaton_250.exe)
Philip Jägenstedt [Sun, 22 Apr 2012 06:23:52 +0000 (08:23 +0200)]
Unicode 補完計畫 2.50 (unicodeaton_250.exe)

This appears to be the last release (2006-01-30) of Unicode 補完計畫.

Unforunately, the download link in
http://www.cpatch.org/thread-6377-1-1.html is broken, so this is from
http://ftp.isu.edu.tw/pub/CPatch/patchutil/unicodeaton/unicodeaton_250.exe
via http://heartfullmoon.blogspot.se/2009/10/windows-7-unicode-250.html

The MD5/SHA1 match the blog post, so this is presumably the original.

5 years agoAnalysis of HKSCS vs UAO samples
Philip Jägenstedt [Wed, 18 Apr 2012 16:54:58 +0000 (18:54 +0200)]
Analysis of HKSCS vs UAO samples

5 years agoCategorize Taiwan HKSCS vs UAO (with help from Yuan Chao)
Philip Jägenstedt [Wed, 18 Apr 2012 14:43:46 +0000 (16:43 +0200)]
Categorize Taiwan HKSCS vs UAO (with help from Yuan Chao)

http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0044.html

5 years agoCategorize Hong Kong HKSCS vs UAO (with help from Yuan Chao)
Philip Jägenstedt [Wed, 18 Apr 2012 09:03:17 +0000 (11:03 +0200)]
Categorize Hong Kong HKSCS vs UAO (with help from Yuan Chao)

http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0041.html

5 years agoCategorize Taiwan HKSCS vs UAO
Philip Jägenstedt [Mon, 16 Apr 2012 21:23:37 +0000 (23:23 +0200)]
Categorize Taiwan HKSCS vs UAO

5 years agoTaiwan HKSCS vs UAO files
Philip Jägenstedt [Wed, 18 Apr 2012 06:51:33 +0000 (08:51 +0200)]
Taiwan HKSCS vs UAO files

117260 URLs generated by gen-url.py (tw-urls.txt)
114276 URLs successfully fetched by get-urls.py
34638 URLs identified as Big5 by make-json.py

In hkscs-vs-uao.py:
34638 URLs pass the "not labeled as HKSCS" test
539 URLs with ambiguous mappings found
345 URLs pass the "likely misencoding" test

5 years agoCategorize Hong Kong HKSCS vs UAO
Philip Jägenstedt [Mon, 16 Apr 2012 21:23:37 +0000 (23:23 +0200)]
Categorize Hong Kong HKSCS vs UAO

5 years agoHong Kong HKSCS vs UAO files
Philip Jägenstedt [Mon, 16 Apr 2012 19:59:45 +0000 (21:59 +0200)]
Hong Kong HKSCS vs UAO files

32242 URLs generated by gen-url.py (hk-urls.txt)
30911 URLs successfully fetched by get-urls.py
4627 URLs identified as Big5 by make-json.py

In hkscs-vs-uao.py:
4527 URLs pass the "not labeled as HKSCS" test
144 URLs with ambiguous mappings found
88 URLs pass the "likely misencoding" test

5 years agoIncrease "error tolerance" in HKSCS vs UAO analysis to 1%
Philip Jägenstedt [Tue, 17 Apr 2012 09:04:06 +0000 (11:04 +0200)]
Increase "error tolerance" in HKSCS vs UAO analysis to 1%

This seems to include a few more legitimate examples.

5 years agoPrint HKSCS vs UAO summaries to separate files
Philip Jägenstedt [Mon, 16 Apr 2012 19:50:02 +0000 (21:50 +0200)]
Print HKSCS vs UAO summaries to separate files

This should make it slightly easier to categorize.

5 years agoDelete analysis generated before recent script changes
Philip Jägenstedt [Mon, 16 Apr 2012 19:46:49 +0000 (21:46 +0200)]
Delete analysis generated before recent script changes

5 years agoSkip files labeled as HKSCS (only ambiguous Big5 is interesting)
Philip Jägenstedt [Mon, 16 Apr 2012 15:14:00 +0000 (17:14 +0200)]
Skip files labeled as HKSCS (only ambiguous Big5 is interesting)

5 years agoGeneralize tw-analyze.py to hkscs-vs-uao.py (process any srcdir)
Philip Jägenstedt [Mon, 16 Apr 2012 14:38:25 +0000 (16:38 +0200)]
Generalize tw-analyze.py to hkscs-vs-uao.py (process any srcdir)

5 years agoGeneralize tw-json.py to make-json.py (process any srcdir)
Philip Jägenstedt [Mon, 16 Apr 2012 14:21:58 +0000 (16:21 +0200)]
Generalize tw-json.py to make-json.py (process any srcdir)

5 years agoGeneralize get-urls.py to allow separate output directories
Philip Jägenstedt [Mon, 16 Apr 2012 11:38:34 +0000 (13:38 +0200)]
Generalize get-urls.py to allow separate output directories

5 years agoGeneralize the Alexa/Bing URL generator to make new .tw and .hk lists
Philip Jägenstedt [Mon, 16 Apr 2012 11:36:50 +0000 (13:36 +0200)]
Generalize the Alexa/Bing URL generator to make new .tw and .hk lists

Market=zh-TW was removed, which may affect the results.

5 years agoRemove more misencoded nonsense
Philip Jägenstedt [Mon, 16 Apr 2012 10:02:53 +0000 (12:02 +0200)]
Remove more misencoded nonsense

It appears that 's and linebreaks are particularly likely to result in
this nonsense, for reasons so far unknown.

5 years agoMerge updated HKSCS vs UAO list with updated error counting
Philip Jägenstedt [Mon, 16 Apr 2012 09:21:29 +0000 (11:21 +0200)]
Merge updated HKSCS vs UAO list with updated error counting

5 years agoWhen analyzing, count decoder errors excluding the preserved points
Philip Jägenstedt [Mon, 16 Apr 2012 09:12:20 +0000 (11:12 +0200)]
When analyzing, count decoder errors excluding the preserved points

Since Big5-UAO maps some bytes that Big5-HKSCS does not, UAO content
will produce more decoder errors using big5-index.txt. To avoid
excluding such content, count decoder errors after decoding with the
preserve list.

This produces a raw list of 427 pages, previously it was 294.

5 years agoManually remove 104 samples of obviously misencoded nonsense
Philip Jägenstedt [Sun, 15 Apr 2012 17:11:40 +0000 (19:11 +0200)]
Manually remove 104 samples of obviously misencoded nonsense

5 years ago294 URLs in need of analysis
Philip Jägenstedt [Sun, 15 Apr 2012 15:25:16 +0000 (17:25 +0200)]
294 URLs in need of analysis

5 years agoTweak analyze formatting to not (easily) exceed 80 columns
Philip Jägenstedt [Sun, 15 Apr 2012 15:24:45 +0000 (17:24 +0200)]
Tweak analyze formatting to not (easily) exceed 80 columns

5 years agoCorrect constant, typo caused index mismatch
Philip Jägenstedt [Sun, 15 Apr 2012 14:29:57 +0000 (16:29 +0200)]
Correct constant, typo caused index mismatch

5 years agoImprove decoder and analyze script to avoid false byte sequence matches
Philip Jägenstedt [Sun, 15 Apr 2012 10:26:37 +0000 (12:26 +0200)]
Improve decoder and analyze script to avoid false byte sequence matches

5 years agoGenerate list of ~300 pages with HKSCS vs UAO issues
Philip Jägenstedt [Sun, 15 Apr 2012 06:23:49 +0000 (08:23 +0200)]
Generate list of ~300 pages with HKSCS vs UAO issues

5 years agoAnalysis script to print HKSCS vs UAO mappings in context
Philip Jägenstedt [Sat, 14 Apr 2012 21:41:07 +0000 (23:41 +0200)]
Analysis script to print HKSCS vs UAO mappings in context

5 years agoCall the index lookup count "indexed" for less confusion
Philip Jägenstedt [Sat, 14 Apr 2012 17:23:48 +0000 (19:23 +0200)]
Call the index lookup count "indexed" for less confusion

5 years agoMerge Decoder into big5.py (since it needs the index)
Philip Jägenstedt [Sat, 14 Apr 2012 15:09:26 +0000 (17:09 +0200)]
Merge Decoder into big5.py (since it needs the index)

5 years agoEscape non-ASCII URLs as %-encoded Big5-HKSCS before fetching
Philip Jägenstedt [Sat, 14 Apr 2012 14:57:52 +0000 (16:57 +0200)]
Escape non-ASCII URLs as %-encoded Big5-HKSCS before fetching

5 years agoPer-spec decoder and script to generate per-URL JSON metadata
Philip Jägenstedt [Sat, 14 Apr 2012 09:49:43 +0000 (11:49 +0200)]
Per-spec decoder and script to generate per-URL JSON metadata

5 years agoSplit URLs by lines, not whitespace
Philip Jägenstedt [Sat, 14 Apr 2012 07:11:06 +0000 (09:11 +0200)]
Split URLs by lines, not whitespace

5 years agoprint_uao_diff
Philip Jägenstedt [Fri, 13 Apr 2012 20:10:41 +0000 (22:10 +0200)]
print_uao_diff

5 years agoSplit make_ranges from print_undefined
Philip Jägenstedt [Fri, 13 Apr 2012 15:45:21 +0000 (17:45 +0200)]
Split make_ranges from print_undefined

5 years agoSync index-big5.txt with spec
Philip Jägenstedt [Fri, 13 Apr 2012 15:37:15 +0000 (17:37 +0200)]
Sync index-big5.txt with spec

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-big5.txt

Matches big5-foolip except for 5 mappings.

5 years agoAdd script to get URLs from a list
Philip Jägenstedt [Fri, 13 Apr 2012 14:04:30 +0000 (16:04 +0200)]
Add script to get URLs from a list

Will be used to scrape tw-urls.txt

5 years agoGenerated ~120k .tw URLs with Alexa and Bing
Philip Jägenstedt [Fri, 13 Apr 2012 09:50:46 +0000 (11:50 +0200)]
Generated ~120k .tw URLs with Alexa and Bing

5 years agoAdd script for generate list of URLs with Alexa and Bing
Philip Jägenstedt [Thu, 12 Apr 2012 21:17:36 +0000 (23:17 +0200)]
Add script for generate list of URLs with Alexa and Bing

5 years agoBig5 and GB* URL lists extracted from dotnetdotcom.org
Philip Jägenstedt [Thu, 12 Apr 2012 09:49:10 +0000 (11:49 +0200)]
Big5 and GB* URL lists extracted from dotnetdotcom.org

Thanks to annevk and zcorpan for extracting these for me!

5 years agoAssert that Kangxi Radicals are not used (disabled)
Philip Jägenstedt [Mon, 9 Apr 2012 08:27:53 +0000 (10:27 +0200)]
Assert that Kangxi Radicals are not used (disabled)

5 years agoPrint reverse mappings
Philip Jägenstedt [Sun, 8 Apr 2012 13:41:27 +0000 (15:41 +0200)]
Print reverse mappings

5 years agoDon't contradict IE, except for F9FE => U+FFED
Philip Jägenstedt [Sun, 8 Apr 2012 12:10:02 +0000 (14:10 +0200)]
Don't contradict IE, except for F9FE => U+FFED

5 years agoPrint HKSCS-2008 normalized differences
Philip Jägenstedt [Sun, 8 Apr 2012 11:03:43 +0000 (13:03 +0200)]
Print HKSCS-2008 normalized differences

5 years agoFix off-by-one error in get_bytes (trail 0xA1 became 0x7F)
Philip Jägenstedt [Sat, 7 Apr 2012 19:22:21 +0000 (21:22 +0200)]
Fix off-by-one error in get_bytes (trail 0xA1 became 0x7F)

5 years agoUpdate index-big5.txt to match Big5-foolip
Philip Jägenstedt [Sat, 7 Apr 2012 07:53:20 +0000 (09:53 +0200)]
Update index-big5.txt to match Big5-foolip

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035375.html

5 years agoCompare with annevk's spec
Philip Jägenstedt [Sat, 7 Apr 2012 07:50:53 +0000 (09:50 +0200)]
Compare with annevk's spec

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-big5.txt

5 years agoInternal renaming (spec->big5_foolip)
Philip Jägenstedt [Sat, 7 Apr 2012 07:41:41 +0000 (09:41 +0200)]
Internal renaming (spec->big5_foolip)

5 years agoSupport printing a pretty table of the full mapping
Philip Jägenstedt [Fri, 6 Apr 2012 20:13:53 +0000 (22:13 +0200)]
Support printing a pretty table of the full mapping

5 years agoManually copy the multi-code point mappings from HKSCS-2008
Philip Jägenstedt [Fri, 6 Apr 2012 19:55:10 +0000 (21:55 +0200)]
Manually copy the multi-code point mappings from HKSCS-2008

5 years agoManual mappings where browers disagree
Philip Jägenstedt [Fri, 6 Apr 2012 19:48:01 +0000 (21:48 +0200)]
Manual mappings where browers disagree

5 years agoIgnore weird IE mappings to "?"
Philip Jägenstedt [Fri, 6 Apr 2012 19:30:07 +0000 (21:30 +0200)]
Ignore weird IE mappings to "?"

5 years agoSimplify the assertions by special-casing the only exception
Philip Jägenstedt [Fri, 6 Apr 2012 18:49:23 +0000 (20:49 +0200)]
Simplify the assertions by special-casing the only exception

5 years agoPrint more details about undefined mappings
Philip Jägenstedt [Fri, 6 Apr 2012 18:38:00 +0000 (20:38 +0200)]
Print more details about undefined mappings

5 years agoSimplify code with is_valid for checking PUA and U+FFFD
Philip Jägenstedt [Fri, 6 Apr 2012 18:20:30 +0000 (20:20 +0200)]
Simplify code with is_valid for checking PUA and U+FFFD

5 years agoPrint using the format of the HKSCS-2008 spec, e.g. C8F1
Philip Jägenstedt [Fri, 6 Apr 2012 15:34:41 +0000 (17:34 +0200)]
Print using the format of the HKSCS-2008 spec, e.g. C8F1

5 years agoPrint a readable summary of missing mappings
Philip Jägenstedt [Fri, 6 Apr 2012 13:35:57 +0000 (15:35 +0200)]
Print a readable summary of missing mappings

5 years agoClarify the logic for multiple code points in HKSCS-2008
Philip Jägenstedt [Fri, 6 Apr 2012 11:55:34 +0000 (13:55 +0200)]
Clarify the logic for multiple code points in HKSCS-2008

5 years agoMove the Big5 sanity check to later
Philip Jägenstedt [Fri, 6 Apr 2012 11:39:38 +0000 (13:39 +0200)]
Move the Big5 sanity check to later

5 years agoCheck all PUA ranges, not just the first
Philip Jägenstedt [Fri, 6 Apr 2012 11:39:04 +0000 (13:39 +0200)]
Check all PUA ranges, not just the first

5 years agobig5.py can now generate a hopefully compatible spec
Philip Jägenstedt [Fri, 6 Apr 2012 10:23:03 +0000 (12:23 +0200)]
big5.py can now generate a hopefully compatible spec

18584 of 19782 mappings are defined (94%)

5 years agoHKSCS-2008 official mappings
Philip Jägenstedt [Fri, 6 Apr 2012 07:58:55 +0000 (09:58 +0200)]
HKSCS-2008 official mappings

http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/download_area/mapping_table_2008.htm

5 years agoFinished the analysis
Philip Jägenstedt [Thu, 5 Apr 2012 21:24:39 +0000 (23:24 +0200)]
Finished the analysis

5 years agoversion control all the things!
Philip Jägenstedt [Thu, 5 Apr 2012 18:37:27 +0000 (20:37 +0200)]
version control all the things!