博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
词频统计预处理
阅读量:6196 次
发布时间:2019-06-21

本文共 4004 字,大约阅读时间需要 13 分钟。

一、英文词频统计

1.下载一首英文的歌词或文章

We all know that environment is so important to ourselves and our future generations.

Natural resources have been depleted in an unprecedented scale.
The environment has been polluted in a way that never happened before.
It is certain that the world and all the living organism on it are going straight to hell.
But why those in power, no matter how loud they speak out environmental protection, very few of them really care. The reason is simple. Human beings are greedy in nature. In ancient times, technology is lacking, human beings did not have the right tool to exploit the nature on large scale. With industrial revolution and the development of science and technology, these things can be achieved with relative ease. It can be said that the development of science can be a gospel and a curse on human race at the same time. It is more than certain that the world is going straight to hell. Climate change comes at an unprecedented rate. We can see all the polar ice sheet melt in our own lifetime. Cities by the sea will be flooded. Summer will get unbearably hot. Almost all the natural resources will be depleted. It is not that world leaders are unaware of this , but because of their greed no one is able to put the interest of the general public and future generations over their own pride. Development sounds an untouchable truth. Anything that comes in its way will be neglected. One thing that we never ponder is that the space and resources on this planet is limited which means that the raw material and space for development is also limited. Now matter how great and intelligent human beings might be, we have our own weakness.
The more intelligent a creature is, the more physically vulnerable it is.
With the worsening of the living environment, one can rarely predict that how many of us will eventually survive this unprecedented change. It is time for us to think whether we should live in a more environmentally friendly manner so that our offsprings will also have space and resources to live with or we just pamper ourselves to the extreme and forget about our future generation and the human race at large.

2.将所有,.?!’:等分隔符全部替换为空格

sep = ''':.,?!'''for i in sep:    article = article.replace(i,' ');

3.将所有大写转换为小写

article = article.lower();

4.生成单词列表

article_list = article.split();print(article_list);

5.生成词频统计

# # ①统计,遍历集合# article_dict={}# article_set =set(article_list)-exclude# 清除重复的部分# for w in article_set:#     article_dict[w] = article_list.count(w)# # 遍历字典# for w in article_dict:#     print(w,article_dict[w])#方法②,遍历列表article_dict={}for w in article_list:    article_dict[w] = article_dict.get(w,0)+1# 排除不要的单词for w in exclude:    del (article_dict[w]);for w in article_dict:    print(w,article_dict[w])  

6.排序

dictList = list(article_dict.items())dictList.sort(key=lambda x:x[1],reverse=True);  

7.排除语法型词汇,代词、冠词、连词

exclude = {'the','to','is','and'}for w in exclude:    del (article_dict[w]); 

8.输出词频最大TOP20

for i in range(20):     print(dictList[i])  

9.将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

file =  open("test.txt", "r",encoding='utf-8');article = file.read();file.close()

二、中文词频统计,下载一长篇中文文章。

代码:

import jieba#打开文件file = open("gzccnews.txt",'r',encoding="utf-8")notes = file.read();file.close();#替换标点符号sep = ''':。,?!;∶ ...“”'''for i in sep:    notes = notes.replace(i,' ');notes_list = list(jieba.cut(notes));#排除单词exclude =[' ','\n','你','我','他','和','但','了','的','来','是','去','在','上','高']#方法②,遍历列表notes_dict={}for w in notes_list:    notes_dict[w] = notes_dict.get(w,0)+1# 排除不要的单词for w in exclude:    del (notes_dict[w]);for w in notes_dict:    print(w,notes_dict[w])# 降序排序dictList = list(notes_dict.items())dictList.sort(key=lambda x:x[1],reverse=True);print(dictList)#输出词频最大TOP20for i in range(20):    print(dictList[i])#把结果存放到文件里outfile = open("top20.txt","a")for i in range(20):    outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n")outfile.close();

截图:

把文章转化为字典:

排序,输出Top20

写入文件:

 

转载于:https://www.cnblogs.com/2439466501qq/p/8658600.html

你可能感兴趣的文章
关于WM_CREATE消息
查看>>
从员工到老板的5个步骤 (转载)
查看>>
[转]MMORPG游戏服务器端的设计
查看>>
25. GameProject3
查看>>
SQL Server 连接超时案例一则
查看>>
SQL远程连接数据库进行数据操作
查看>>
我常用之呼和浩特快递联系电话(顺丰、申通、圆通 转)
查看>>
android-验证网络是否可用
查看>>
GDI GDI+ 的区别
查看>>
VC/MFC 使edit控件不能进行粘贴操作
查看>>
微软职位内部推荐-SW Engineer II for Skype
查看>>
基于zookeeper、连接池、Failover/LoadBalance等改造Thrift 服务化
查看>>
[转载]c删除字符串中指定字符
查看>>
Excel二次开发迁移到WPS上
查看>>
【转】深度完美GhostXP_SP3快速装机优化版V2011.12
查看>>
Oracle 客户端 使用 expdp/impdp 示例 说明
查看>>
Android 修改 hosts
查看>>
Android LogCat使用详解
查看>>
Linux进程和网络连接命令
查看>>
Codec 2 : 一款新的低码率语音编码器
查看>>