`
yiminghe
  • 浏览: 1428432 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

tokenization of html

 
阅读更多

html 符号解析问题

 

场景:

 

在页面上输出包含已有数据的 textarea ,一般的做法即是,将所有的数据从数据库取出后都 escapeHtml 一下:

 

<textarea>&lt;script&gt;if(a&amp;&amp;1)alert(1);&lt;script&gt;</textarea>

 

页面瞬时增大了很多,特别是对于富文本情况(包含了很多 < > &) 等,但这又是必须做的,否则会被恶意结束 textarea 标签而造成 script 注入问题,但还是存在可以进一步减少体积的余地。

 

 

规范:

 

查看 html 对 textarea 标签及其内容的解析规则 :实际上textarea 的内容解析规则是按照 RCDATA 类型(一系列状态),简单描述如下

 

1. 遇到 & ,尽力得到实体字符 代表的值

2. 遇到 < , 如果下一个字符为 / 则结束当前标签

3. 否则作为 textarea 内容

 

恶意结束标签主要发生在 2 ,那么我们只要打破 2 ,保证 < 和 / 不相连即可,在服务器端渲染页面事先只做一点处理,再把处理后的内容放在 textarea 中:

 

"<a>x</>".replace(/<\//gi,"&lt;/")

 

而如果允许用户输入 &gt; 等代表实体字符的字符串,则还要进行 & 替换:

 

"<a>x</>".replace(/&/gi,"&amp;")

 

最终页面体积也能减小不少.

 

其他类型:

 

 tokenization 部分还存在其他特殊类型的解析规则,比如常见的 script ,在 xhtml 时代,推荐的写法是:

 

<script type="text/javascript">
/* <![CDATA[ */
// TODO
/* ]]> */
</script> 

 

这在 html 中也是支持的, cdata 的解析规则 比较简单:在遇到 ]]> 之前都算作 script 的内容,并不解析实体字符.

 

而 html 则是完全可以不用 cdata 这个规则的 ,script 解析本身就是一种单独规则了:

 

1. 类似 rcdata (因此代码中不能出现类似 "</script>",必须改做 "<\/script>"),但是不解析实体字符

2. 还识别了 <!-- ,似乎是为了兼容不支持js的远古浏览器的注释:

 

 

<script type="text/javascript">
<!-- // hide from really old browsers that noone uses anymore
// TODO
// -->
</script> 

 

这在现在看来则是完全不必要了。

 

 

 

 

分享到:
评论
2 楼 deng131 2011-09-24  

1 楼 lifesinger 2011-08-30  
顶,很不错的文章。

相关推荐

    Tokenization 分析及研究

    Token技术产生的背景、解决的问题及应用分析

    TOKENIZATION

    TOKENIZATION

    Critical Tokenization and its Properties

    郭进guo jin博士论文,关于自然语言处理

    The Lancaster Corpus of Mandarin Chinese

    Linguistic annotations undertaken on the corpus include tokenization and part-of-speech tagging. The whole corpus is annotated at the word level and includes orthographic and morphological ...

    tokenization

    tokenization

    English tokenization

    Moses中自带的英文分词,大家可以下载使用处理英文的分词,具体的使用命令就是tokenizer.perl example.txt

    cpp-YouTokenToMe高性能无监督文本标记化tokenization工具

    YouTokenToMe:高性能无监督文本标记化(tokenization)工具

    通用安全的token化解决方案-美化版.pdf

    通用安全的token化解决方案-美化版.pdf

    Python Text Processing with NLTK 2.0 Cookbook.pdf

    Get started off with learning tokenization of text. Receive an overview of WordNet and how to use it. Learn the basics as well as advanced features of stemming and lemmatization. Discover various ...

    stanford chinese segmentor

    Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other ...

    No module named ‘transformers.models.auto.tokenization-auto‘

    gpt2分词器离线资源

    Mastering-Feature-Engineering-Principles-Techniques.pdf

    Implementing bag-of-words: parsing and tokenization 20 Bag-of-N-Grams 21 Collocation Extraction for Phrase Detection 23 Quick summary 26 Filtering for Cleaner Features 26 Stopwords 26 Frequency-based ...

    Python 3 Text Processing with NLTK 3 Cookbook(PACKT,2014)

    Starting with tokenization, stemming, and the WordNet dictionary, you'll progress to part-of-speech tagging, phrase chunking, and named entity recognition. You'll learn how various text corpora are ...

    NLTP 3 python 3

    Starting with tokenization, stemming, and the WordNet dictionary, you'll progress to part-of-speech tagging, phrase chunking, and named entity recognition. You'll learn how various text corpora are ...

    MasterThesis_Tokenization

    MasterThesis_Tokenization BIESX_Tag.py:使用flair.datasets从UD框架下载训练,文本,开发数据集,并添加边界标签并生成训练,文本,为此论文开发数据集,并在文件夹数据中输出 Typologic_Factor_Analysis.py:...

    NLTK.Essentials

    Clean and wrangle text using tokenization and chunking to help you better process data Explore the different types of tags available and learn how to tag sentences Create a customized parser and ...

    StringToken

    String tokenization is defined as the problem that consists of breaking up a string into tokens which are seperated by delimiters. Both tokens and delimiters are themselves strings. Commonly used ...

    Python库 | bert_tokenizer-0.1.1.tar.gz

    资源分类:Python库 所属语言:Python 资源全名:bert_tokenizer-0.1.1.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059

    Python 3 Text Processing with NLTK 3 Cookbook

    Starting with tokenization, stemming, and the WordNet dictionary, you'll progress to part-of-speech tagging, phrase chunking, and named entity recognition. You'll learn how various text corpora are ...

    asics_tokenization

    Zionodes代币ETH Zionodes令牌化架构 用户有机会购买ASIC令牌,该令牌代表物理设备,然后可以通过我们网站上可用的抵押令牌的挖掘功能,将其用作挖掘工具,以产生收入。 用户可以使用“购买”功能直接通过项目门户...

Global site tag (gtag.js) - Google Analytics