Python3 Encoding Decoding_壹加壹

http://blog.sina.com.cn/u/2600122173

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

Python3 Encoding Decoding

(2017-07-16 05:09:33)

分类： Python与机器学习


oath='2017'
print(type(oath))

oath=oath.encode('utf-8')

print(type(oath))

print(oath)

oath=oath.decode()

print(oath)


输出结果：

http://s15/mw690/002PXQypzy7cG0DwInI5e&690Encoding Decoding"  TITLE="Python3 Encoding Decoding" />

oath='三人行必有我师'
print(type(oath))

oath=oath.encode('utf-8')

print(type(oath))

print(oath)

oath=oath.decode()

print(oath)

输出结果：

http://s15/mw690/002PXQypzy7cG0xrc3cae&690Encoding Decoding"  TITLE="Python3 Encoding Decoding" />

oath='Pirates of the Caribbean'
print(type(oath))

oath=oath.encode('utf-8')

print(type(oath))

print(oath)

oath=oath.decode()

print(oath)

输出结果：

http://s5/mw690/002PXQypzy7cG0snlFqe4&690Encoding Decoding"  TITLE="Python3 Encoding Decoding" />

看来，encode是把字符串（str）转换成字节型（byte）

oath='Pirates of the Caribbean'

f=open('stringtest.txt','r',encoding='utf8')  #stringtest.txt中有一行Pirates of the Caribbean

s=f.readline().strip()

if s==oath:

print('相等')

else:

print('不等')

如果stringtest.txt是ansi格式，输出结果为：相等

如果stringtest.txt是unicode格式，报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

如果stringtest.txt是utf-8格式，输出结果为：不等

把字符串内容换成中文


oath='中国'   #换成oath=u'中国' 无任何影响
f=open('stringtest.txt','r')  #如果换成 f=open('stringtest.txt','r',encoding='utf-8')，情况又不一样，见下一段。其实


                    #不加encoding参数，默认是使用locale.getpreferredencoding()的编码格式，我的电脑是


                    #cp936，即GBK

s=f.readline().strip()

if s==oath:

    print('相等')

else:

    print('不等')


如果stringtest.txt是ansi格式，输出结果为：相等


如果stringtest.txt是unicode格式，报错：

UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence


如果stringtest.txt是utf-8格式，报错：

UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 0: incomplete multibyte sequence

oath='中国'   #换成oath=u'中国' 无任何影响
print('原始值：',oath)

print('原始类型：',type(oath))

# oath=oath.encode('utf-8') 
# print('encode后值：',oath)
# print('encode后类型：',type(oath))
# oath=oath.decode()
# print('再decode后值：',oath)
# print('再decode后类型：',type(oath))

f=open('stringtest.txt','r',encoding='utf-8')

s=f.readline().strip()

print('文本中原始值：',s)

print('文本中原始类型：',type(s))

if s==oath:

    print('相等')

else:

    print('不等')

如果stringtest.txt是ansi格式，报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 0: invalid continuation byte

如果stringtest.txt是unicode格式，报错：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

如果stringtest.txt是utf-8格式，输出结果为：

http://s14/mw690/002PXQypzy7cG37FYlL7d&690Encoding Decoding"  TITLE="Python3 Encoding Decoding" />

看到了没，明明看着相等，但程序判断就是不等。

再看：


import locale



print(locale.getpreferredencoding())



oath='中国'   #换成oath=u'中国' 无任何影响
print('原始值：',oath)

print('原始类型：',type(oath))

oath1=oath.encode('utf-8')  #utf-8换成utf8无任何影响
print('encode后值：',oath1)

print('encode后类型：',type(oath1))

oath2=oath1.decode()

print('再decode后值：',oath2)

print('再decode后类型：',type(oath2))



f=open('stringtest.txt','r',encoding='utf-8')

s=f.readline().strip()

print('文本中原始值：',s)

print('文本中原始类型：',type(s))

s1=s.encode('utf-8')

print('encode后值：',s1)

print('encode后类型：',type(s1))

s2=s1.decode()

print('再decode后值：',s2)

print('再decode后类型：',type(s2))

if oath2==s2:

    print('相等')

else:

    print('不等')

当文本文件是utf-8格式时，运行结果为：


http://s4/mw690/002PXQypzy7cGfmQZurb3&690Encoding Decoding"  TITLE="Python3 Encoding Decoding" />

注：cp936就是gbk

又发现一个重大问题，就是文本文件如果是由python执行open(文件名,'w',encoding='utf-8')新建以及write进内容的话，最后的结果是相等！相等！相等！

总结：好乱http://www/uc/myshow/blog/misc/gif/E___7394ZH00SIGG.gifEncoding Decoding"  TITLE="Python3 Encoding Decoding" />


（1）当文本文件是操作系统建立或操作时，从文本中读取出来的字符串，只有当文本文件是ansi格式时，与程序代码中直接定义的字符串才能相等，否则，看着相等，实际不等。


（2）文本文件，建立、读、写操作时必须使用统一的编码格式，如使用encoding='utf-8'，这意味着当该文本不存在时，则新建utf-8格式的文本，不管其写入的内容是中英文，当再重新读取该文本文件时，只要还使用encoding='utf8'，内容能够正常显示，但如果读取时使用不同的编码格式，就会报错。


（3）如想使用ANSI编码，方法是任何时候都省略encoding=''参数，decode（）、encode（）参数都为空。这样，默认就会使用ANSI。


（4）看来python内部对字符串的处理和操作系统文本文件对字符的处理不一样。一个文本文件只要利用操作系统修改过，就跟python直接处理的不一样。


看看下面这篇文章：

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Phantomjs 等待页面加载完成

后一篇：dataframe字符串转时间

新浪BLOG意见反馈留言板　欢迎批评指正