whatwg:big5.git
6 years agoCategorize Taiwan HKSCS vs UAO (with help from Yuan Chao)
Philip Jägenstedt [Wed, 18 Apr 2012 14:43:46 +0000 (16:43 +0200)]
Categorize Taiwan HKSCS vs UAO (with help from Yuan Chao)

http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0044.html

6 years agoCategorize Hong Kong HKSCS vs UAO (with help from Yuan Chao)
Philip Jägenstedt [Wed, 18 Apr 2012 09:03:17 +0000 (11:03 +0200)]
Categorize Hong Kong HKSCS vs UAO (with help from Yuan Chao)

http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0041.html

6 years agoCategorize Taiwan HKSCS vs UAO
Philip Jägenstedt [Mon, 16 Apr 2012 21:23:37 +0000 (23:23 +0200)]
Categorize Taiwan HKSCS vs UAO

6 years agoTaiwan HKSCS vs UAO files
Philip Jägenstedt [Wed, 18 Apr 2012 06:51:33 +0000 (08:51 +0200)]
Taiwan HKSCS vs UAO files

117260 URLs generated by gen-url.py (tw-urls.txt)
114276 URLs successfully fetched by get-urls.py
34638 URLs identified as Big5 by make-json.py

In hkscs-vs-uao.py:
34638 URLs pass the "not labeled as HKSCS" test
539 URLs with ambiguous mappings found
345 URLs pass the "likely misencoding" test

6 years agoCategorize Hong Kong HKSCS vs UAO
Philip Jägenstedt [Mon, 16 Apr 2012 21:23:37 +0000 (23:23 +0200)]
Categorize Hong Kong HKSCS vs UAO

6 years agoHong Kong HKSCS vs UAO files
Philip Jägenstedt [Mon, 16 Apr 2012 19:59:45 +0000 (21:59 +0200)]
Hong Kong HKSCS vs UAO files

32242 URLs generated by gen-url.py (hk-urls.txt)
30911 URLs successfully fetched by get-urls.py
4627 URLs identified as Big5 by make-json.py

In hkscs-vs-uao.py:
4527 URLs pass the "not labeled as HKSCS" test
144 URLs with ambiguous mappings found
88 URLs pass the "likely misencoding" test

6 years agoIncrease "error tolerance" in HKSCS vs UAO analysis to 1%
Philip Jägenstedt [Tue, 17 Apr 2012 09:04:06 +0000 (11:04 +0200)]
Increase "error tolerance" in HKSCS vs UAO analysis to 1%

This seems to include a few more legitimate examples.

6 years agoPrint HKSCS vs UAO summaries to separate files
Philip Jägenstedt [Mon, 16 Apr 2012 19:50:02 +0000 (21:50 +0200)]
Print HKSCS vs UAO summaries to separate files

This should make it slightly easier to categorize.

6 years agoDelete analysis generated before recent script changes
Philip Jägenstedt [Mon, 16 Apr 2012 19:46:49 +0000 (21:46 +0200)]
Delete analysis generated before recent script changes

6 years agoSkip files labeled as HKSCS (only ambiguous Big5 is interesting)
Philip Jägenstedt [Mon, 16 Apr 2012 15:14:00 +0000 (17:14 +0200)]
Skip files labeled as HKSCS (only ambiguous Big5 is interesting)

6 years agoGeneralize tw-analyze.py to hkscs-vs-uao.py (process any srcdir)
Philip Jägenstedt [Mon, 16 Apr 2012 14:38:25 +0000 (16:38 +0200)]
Generalize tw-analyze.py to hkscs-vs-uao.py (process any srcdir)

6 years agoGeneralize tw-json.py to make-json.py (process any srcdir)
Philip Jägenstedt [Mon, 16 Apr 2012 14:21:58 +0000 (16:21 +0200)]
Generalize tw-json.py to make-json.py (process any srcdir)

6 years agoGeneralize get-urls.py to allow separate output directories
Philip Jägenstedt [Mon, 16 Apr 2012 11:38:34 +0000 (13:38 +0200)]
Generalize get-urls.py to allow separate output directories

6 years agoGeneralize the Alexa/Bing URL generator to make new .tw and .hk lists
Philip Jägenstedt [Mon, 16 Apr 2012 11:36:50 +0000 (13:36 +0200)]
Generalize the Alexa/Bing URL generator to make new .tw and .hk lists

Market=zh-TW was removed, which may affect the results.

6 years agoRemove more misencoded nonsense
Philip Jägenstedt [Mon, 16 Apr 2012 10:02:53 +0000 (12:02 +0200)]
Remove more misencoded nonsense

It appears that 's and linebreaks are particularly likely to result in
this nonsense, for reasons so far unknown.

6 years agoMerge updated HKSCS vs UAO list with updated error counting
Philip Jägenstedt [Mon, 16 Apr 2012 09:21:29 +0000 (11:21 +0200)]
Merge updated HKSCS vs UAO list with updated error counting

6 years agoWhen analyzing, count decoder errors excluding the preserved points
Philip Jägenstedt [Mon, 16 Apr 2012 09:12:20 +0000 (11:12 +0200)]
When analyzing, count decoder errors excluding the preserved points

Since Big5-UAO maps some bytes that Big5-HKSCS does not, UAO content
will produce more decoder errors using big5-index.txt. To avoid
excluding such content, count decoder errors after decoding with the
preserve list.

This produces a raw list of 427 pages, previously it was 294.

6 years agoManually remove 104 samples of obviously misencoded nonsense
Philip Jägenstedt [Sun, 15 Apr 2012 17:11:40 +0000 (19:11 +0200)]
Manually remove 104 samples of obviously misencoded nonsense

6 years ago294 URLs in need of analysis
Philip Jägenstedt [Sun, 15 Apr 2012 15:25:16 +0000 (17:25 +0200)]
294 URLs in need of analysis

6 years agoTweak analyze formatting to not (easily) exceed 80 columns
Philip Jägenstedt [Sun, 15 Apr 2012 15:24:45 +0000 (17:24 +0200)]
Tweak analyze formatting to not (easily) exceed 80 columns

6 years agoCorrect constant, typo caused index mismatch
Philip Jägenstedt [Sun, 15 Apr 2012 14:29:57 +0000 (16:29 +0200)]
Correct constant, typo caused index mismatch

6 years agoImprove decoder and analyze script to avoid false byte sequence matches
Philip Jägenstedt [Sun, 15 Apr 2012 10:26:37 +0000 (12:26 +0200)]
Improve decoder and analyze script to avoid false byte sequence matches

6 years agoGenerate list of ~300 pages with HKSCS vs UAO issues
Philip Jägenstedt [Sun, 15 Apr 2012 06:23:49 +0000 (08:23 +0200)]
Generate list of ~300 pages with HKSCS vs UAO issues

6 years agoAnalysis script to print HKSCS vs UAO mappings in context
Philip Jägenstedt [Sat, 14 Apr 2012 21:41:07 +0000 (23:41 +0200)]
Analysis script to print HKSCS vs UAO mappings in context

6 years agoCall the index lookup count "indexed" for less confusion
Philip Jägenstedt [Sat, 14 Apr 2012 17:23:48 +0000 (19:23 +0200)]
Call the index lookup count "indexed" for less confusion

6 years agoMerge Decoder into big5.py (since it needs the index)
Philip Jägenstedt [Sat, 14 Apr 2012 15:09:26 +0000 (17:09 +0200)]
Merge Decoder into big5.py (since it needs the index)

6 years agoEscape non-ASCII URLs as %-encoded Big5-HKSCS before fetching
Philip Jägenstedt [Sat, 14 Apr 2012 14:57:52 +0000 (16:57 +0200)]
Escape non-ASCII URLs as %-encoded Big5-HKSCS before fetching

6 years agoPer-spec decoder and script to generate per-URL JSON metadata
Philip Jägenstedt [Sat, 14 Apr 2012 09:49:43 +0000 (11:49 +0200)]
Per-spec decoder and script to generate per-URL JSON metadata

6 years agoSplit URLs by lines, not whitespace
Philip Jägenstedt [Sat, 14 Apr 2012 07:11:06 +0000 (09:11 +0200)]
Split URLs by lines, not whitespace

6 years agoprint_uao_diff
Philip Jägenstedt [Fri, 13 Apr 2012 20:10:41 +0000 (22:10 +0200)]
print_uao_diff

6 years agoSplit make_ranges from print_undefined
Philip Jägenstedt [Fri, 13 Apr 2012 15:45:21 +0000 (17:45 +0200)]
Split make_ranges from print_undefined

6 years agoSync index-big5.txt with spec
Philip Jägenstedt [Fri, 13 Apr 2012 15:37:15 +0000 (17:37 +0200)]
Sync index-big5.txt with spec

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-big5.txt

Matches big5-foolip except for 5 mappings.

6 years agoAdd script to get URLs from a list
Philip Jägenstedt [Fri, 13 Apr 2012 14:04:30 +0000 (16:04 +0200)]
Add script to get URLs from a list

Will be used to scrape tw-urls.txt

6 years agoGenerated ~120k .tw URLs with Alexa and Bing
Philip Jägenstedt [Fri, 13 Apr 2012 09:50:46 +0000 (11:50 +0200)]
Generated ~120k .tw URLs with Alexa and Bing

6 years agoAdd script for generate list of URLs with Alexa and Bing
Philip Jägenstedt [Thu, 12 Apr 2012 21:17:36 +0000 (23:17 +0200)]
Add script for generate list of URLs with Alexa and Bing

6 years agoBig5 and GB* URL lists extracted from dotnetdotcom.org
Philip Jägenstedt [Thu, 12 Apr 2012 09:49:10 +0000 (11:49 +0200)]
Big5 and GB* URL lists extracted from dotnetdotcom.org

Thanks to annevk and zcorpan for extracting these for me!

6 years agoAssert that Kangxi Radicals are not used (disabled)
Philip Jägenstedt [Mon, 9 Apr 2012 08:27:53 +0000 (10:27 +0200)]
Assert that Kangxi Radicals are not used (disabled)

6 years agoPrint reverse mappings
Philip Jägenstedt [Sun, 8 Apr 2012 13:41:27 +0000 (15:41 +0200)]
Print reverse mappings

6 years agoDon't contradict IE, except for F9FE => U+FFED
Philip Jägenstedt [Sun, 8 Apr 2012 12:10:02 +0000 (14:10 +0200)]
Don't contradict IE, except for F9FE => U+FFED

6 years agoPrint HKSCS-2008 normalized differences
Philip Jägenstedt [Sun, 8 Apr 2012 11:03:43 +0000 (13:03 +0200)]
Print HKSCS-2008 normalized differences

6 years agoFix off-by-one error in get_bytes (trail 0xA1 became 0x7F)
Philip Jägenstedt [Sat, 7 Apr 2012 19:22:21 +0000 (21:22 +0200)]
Fix off-by-one error in get_bytes (trail 0xA1 became 0x7F)

6 years agoUpdate index-big5.txt to match Big5-foolip
Philip Jägenstedt [Sat, 7 Apr 2012 07:53:20 +0000 (09:53 +0200)]
Update index-big5.txt to match Big5-foolip

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035375.html

6 years agoCompare with annevk's spec
Philip Jägenstedt [Sat, 7 Apr 2012 07:50:53 +0000 (09:50 +0200)]
Compare with annevk's spec

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-big5.txt

6 years agoInternal renaming (spec->big5_foolip)
Philip Jägenstedt [Sat, 7 Apr 2012 07:41:41 +0000 (09:41 +0200)]
Internal renaming (spec->big5_foolip)

6 years agoSupport printing a pretty table of the full mapping
Philip Jägenstedt [Fri, 6 Apr 2012 20:13:53 +0000 (22:13 +0200)]
Support printing a pretty table of the full mapping

6 years agoManually copy the multi-code point mappings from HKSCS-2008
Philip Jägenstedt [Fri, 6 Apr 2012 19:55:10 +0000 (21:55 +0200)]
Manually copy the multi-code point mappings from HKSCS-2008

6 years agoManual mappings where browers disagree
Philip Jägenstedt [Fri, 6 Apr 2012 19:48:01 +0000 (21:48 +0200)]
Manual mappings where browers disagree

6 years agoIgnore weird IE mappings to "?"
Philip Jägenstedt [Fri, 6 Apr 2012 19:30:07 +0000 (21:30 +0200)]
Ignore weird IE mappings to "?"

6 years agoSimplify the assertions by special-casing the only exception
Philip Jägenstedt [Fri, 6 Apr 2012 18:49:23 +0000 (20:49 +0200)]
Simplify the assertions by special-casing the only exception

6 years agoPrint more details about undefined mappings
Philip Jägenstedt [Fri, 6 Apr 2012 18:38:00 +0000 (20:38 +0200)]
Print more details about undefined mappings

6 years agoSimplify code with is_valid for checking PUA and U+FFFD
Philip Jägenstedt [Fri, 6 Apr 2012 18:20:30 +0000 (20:20 +0200)]
Simplify code with is_valid for checking PUA and U+FFFD

6 years agoPrint using the format of the HKSCS-2008 spec, e.g. C8F1
Philip Jägenstedt [Fri, 6 Apr 2012 15:34:41 +0000 (17:34 +0200)]
Print using the format of the HKSCS-2008 spec, e.g. C8F1

6 years agoPrint a readable summary of missing mappings
Philip Jägenstedt [Fri, 6 Apr 2012 13:35:57 +0000 (15:35 +0200)]
Print a readable summary of missing mappings

6 years agoClarify the logic for multiple code points in HKSCS-2008
Philip Jägenstedt [Fri, 6 Apr 2012 11:55:34 +0000 (13:55 +0200)]
Clarify the logic for multiple code points in HKSCS-2008

6 years agoMove the Big5 sanity check to later
Philip Jägenstedt [Fri, 6 Apr 2012 11:39:38 +0000 (13:39 +0200)]
Move the Big5 sanity check to later

6 years agoCheck all PUA ranges, not just the first
Philip Jägenstedt [Fri, 6 Apr 2012 11:39:04 +0000 (13:39 +0200)]
Check all PUA ranges, not just the first

6 years agobig5.py can now generate a hopefully compatible spec
Philip Jägenstedt [Fri, 6 Apr 2012 10:23:03 +0000 (12:23 +0200)]
big5.py can now generate a hopefully compatible spec

18584 of 19782 mappings are defined (94%)

6 years agoHKSCS-2008 official mappings
Philip Jägenstedt [Fri, 6 Apr 2012 07:58:55 +0000 (09:58 +0200)]
HKSCS-2008 official mappings

http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/download_area/mapping_table_2008.htm

6 years agoFinished the analysis
Philip Jägenstedt [Thu, 5 Apr 2012 21:24:39 +0000 (23:24 +0200)]
Finished the analysis

6 years agoversion control all the things!
Philip Jägenstedt [Thu, 5 Apr 2012 18:37:27 +0000 (20:37 +0200)]
version control all the things!