練習問題

../DATA02/txt/にあるファイルのそれぞれで総文数、総語数、総文字数を算出し、DataFrameに保存し、基本統計量を算出しなさい。このDataframeを用いて、総文数のヒストグラム、総語数の箱ひげ図、総文数と総文字数の散布図を描画しなさい。

In [1]:
import os
# ファイル名の抽出
F = os.listdir("../DATA02/txt/")
F.sort()
# 作文ファイルの読み込み
# Tはひとつの要素が学習者ひとりひとりが書いた作文

T = []

for f in F:
    t = open("../DATA02/txt/"+f,'r')
    text = t.read()
    T.append(text)
In [2]:
T[0]
Out[2]:
'She enjoyed, ah Rena enjoyed taking violin lessons, but she was getting tired of it. So, her teacher said that she should, uh she should practice playing the violin every day. And her and her parents, ehhh the violin lesson is too expensive to continue if she was going to be lazy. So, he so she so she so she quit the violin lessons. But, several years later, she she went to she went to a violin concert and she was very impressed with she was very impressed by.\n'
In [3]:
# sent_tokenize、word_tokenizeのimport
from nltk import sent_tokenize, word_tokenize

# 文の数を数える
SENT = []

for t in T:
    SENT.append(len(sent_tokenize(t)))

# 単語の数を数える
WORD = []

for t in T:
    WORD.append(len(word_tokenize(t)))

# 文字数を数える
LETTER = []

for t in T:
    s = "".join(t)
    LETTER.append(len(s))
In [4]:
# pandasのimport
import pandas as pd

# データフレームの作成
# 「"列名":データを含むリスト」をひとつのセットして書く
# "index"は行名。ここではファイル名が保存してあるFを指定した
data = pd.DataFrame({"num_of_sents":SENT,
                                        "num_of_words":WORD,
                                        "num_of_letter":LETTER},
                                       index=F)
In [5]:
data.head()
Out[5]:
num_of_sents num_of_words num_of_letter
S003.txt 5 102 465
S004.txt 1 33 207
S005.txt 5 76 370
S007.txt 7 93 492
S010.txt 8 139 655
In [6]:
# 基本統計量
data.describe()
Out[6]:
num_of_sents num_of_words num_of_letter
count 118.000000 118.000000 118.000000
mean 3.661017 95.355932 477.110169
std 2.935839 28.414371 134.807605
min 1.000000 22.000000 123.000000
25% 1.000000 78.250000 401.250000
50% 3.000000 95.500000 484.000000
75% 6.000000 114.000000 569.750000
max 11.000000 170.000000 825.000000
In [7]:
data["num_of_sents"].plot(kind="hist")
Out[7]:
<AxesSubplot: ylabel='Frequency'>
No description has been provided for this image
In [8]:
data["num_of_words"].plot(kind="box")
Out[8]:
<AxesSubplot: >
No description has been provided for this image
In [9]:
data.plot(kind="scatter",x="num_of_sents",y="num_of_letter")
Out[9]:
<AxesSubplot: xlabel='num_of_sents', ylabel='num_of_letter'>
No description has been provided for this image