kaggle1位の解析手法　「メルカリにおける値段推定」2.可視化

前回は過去kaggleコンペでメルカリが「メルカリにおける値段推定」（Mercari Price Suggestion Challenge）のデータ概要を解説します。今回はデータ可視化を解説したいと思います。

2. Kaggleメルカリのデータ
___2.1 データの概要
___2.2 データ可視化
___2.3 データの前処理

2. Kaggleメルカリのデータ

データは2stageになります。2stageコンペでは、配布されたテストデータの全量がpublic LB用の評価データ（stage 1）、競技者には非公開のテストデータがprivate LB用の評価データ（stage 2）となります。

では、データを確認しましょう。

train = pd.read_csv(f'{PATH}train.tsv', sep='\t')
test = pd.read_csv(f'{PATH}test.tsv', sep='\t')
print(train.shape)
print(test.shape)

(1482535, 8)
(693359, 7)

train.head()

目的変数： Price

train.price.describe()

count 1.482535e+06
mean 2.673752e+01
std 3.858607e+01
min 0.000000e+00
25% 1.000000e+01
50% 1.700000e+01
75% 2.900000e+01
max 2.009000e+03
Name: price, dtype: float64

輸送費

アイテム　カテゴリー

train['category_name'].value_counts()[:5]

Women/Athletic Apparel/Pants, Tights, Leggings 60177
Women/Tops & Blouses/T-Shirts 46380
Beauty/Makeup/Face 34335
Beauty/Makeup/Lips 29910
Electronics/Video Games & Consoles/Games 26557
Name: category_name, dtype: int64

def split_cat(text):
try: return text.split("/")
except: return ("No Label", "No Label", "No Label")

train['general_cat'], train['subcat_1'], train['subcat_2'] = \
zip(*train['category_name'].apply(lambda x: split_cat(x)))
train.head()

print("There are %d unique first sub-categories." % train['subcat_1'].nunique())

There are 114 unique first sub-categories.

print("There are %d unique second sub-categories." % train['subcat_2'].nunique())

There are 871 unique second sub-categories.

x = train['general_cat'].value_counts().index.values.astype('str')
y = train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]

trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Number of Items by Main Category',
yaxis = dict(title='Count'),
xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
y = train['subcat_1'].value_counts().values[:15]
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]

NLP前処理：
Tokenization

stop = set(stopwords.words('english'))
def tokenize(text):
"""
sent_tokenize(): segment text into sentences
word_tokenize(): break sentences into words
"""
try: 
regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
text = regex.sub(" ", text) # remove punctuation

tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
tokens = []
for token_by_sent in tokens_:
tokens += token_by_sent
tokens = list(filter(lambda t: t.lower() not in stop, tokens))
filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3]

return filtered_tokens

except TypeError as e: print(text,e)

# tokenizer 処理
train['tokens'] = train['item_description'].map(tokenize)
test['tokens'] = test['item_description'].map(tokenize)

for description, tokens in zip(train['item_description'].head(),
train['tokens'].head()):
print('description:', description)
print('tokens:', tokens)
print()

description: No description yet
tokens: [‘description’, ‘yet’]

description: This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
tokens: [‘keyboard’, ‘great’, ‘condition’, ‘works’, ‘like’, ‘came’, ‘box’, ‘ports’, ‘tested’, ‘work’, ‘perfectly’, ‘lights’, ‘customizable’, ‘via’, ‘razer’, ‘synapse’, ‘app’]

description: Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!
tokens: [‘adorable’, ‘top’, ‘hint’, ‘lace’, ‘key’, ‘hole’, ‘back’, ‘pale’, ‘pink’, ‘also’, ‘available’, ‘white’]

description: New with tags. Leather horses. Retail for [rm] each. Stand about a foot high. They are being sold as a pair. Any questions please ask. Free shipping. Just got out of storage
tokens: [‘new’, ‘tags’, ‘leather’, ‘horses’, ‘retail’, ‘stand’, ‘foot’, ‘high’, ‘sold’, ‘pair’, ‘questions’, ‘please’, ‘ask’, ‘free’, ‘shipping’, ‘got’, ‘storage’]

description: Complete with certificate of authenticity
tokens: [‘complete’, ‘certificate’, ‘authenticity’]

WordCloud作成

cat_desc = dict()
for cat in general_cats: 
text = " ".join(train.loc[train['general_cat']==cat, 'item_description'].values)
cat_desc[cat] = tokenize(text)

# find the most common words for the top 4 categories
women100 = Counter(cat_desc['Women']).most_common(100)
beauty100 = Counter(cat_desc['Beauty']).most_common(100)
kids100 = Counter(cat_desc['Kids']).most_common(100)
electronics100 = Counter(cat_desc['Electronics']).most_common(100)

def generate_wordcloud(tup):
wordcloud = WordCloud(background_color='white',
max_words=50, max_font_size=40,
random_state=42
).generate(str(tup))
return wordcloud

＞　kaggle1位の解析手法　「メルカリにおける値段推定」3. 1位の手法