< kaggle1位の解析手法 「メルカリにおける値段推定」1.データ概要
> kaggle1位の解析手法 「メルカリにおける値段推定」3. 1位の手法
前回は過去kaggleコンペでメルカリが「メルカリにおける値段推定」(Mercari Price Suggestion Challenge)のデータ概要を解説します。今回はデータ可視化を解説したいと思います。
2. Kaggleメルカリのデータ
___2.1 データの概要
___2.2 データ可視化
___2.3 データの前処理
2. Kaggleメルカリのデータ
データは2stageになります。2stageコンペでは、配布されたテストデータの全量がpublic LB用の評価データ(stage 1)、競技者には非公開のテストデータがprivate LB用の評価データ(stage 2)となります。
では、データを確認しましょう。
train = pd.read_csv(f'{PATH}train.tsv', sep='\t') test = pd.read_csv(f'{PATH}test.tsv', sep='\t') print(train.shape) print(test.shape)
(1482535, 8)
(693359, 7)
train.head()
目的変数: Price
train.price.describe()
count 1.482535e+06
mean 2.673752e+01
std 3.858607e+01
min 0.000000e+00
25% 1.000000e+01
50% 1.700000e+01
75% 2.900000e+01
max 2.009000e+03
Name: price, dtype: float64
輸送費
アイテム カテゴリー
train['category_name'].value_counts()[:5]
Women/Athletic Apparel/Pants, Tights, Leggings 60177
Women/Tops & Blouses/T-Shirts 46380
Beauty/Makeup/Face 34335
Beauty/Makeup/Lips 29910
Electronics/Video Games & Consoles/Games 26557
Name: category_name, dtype: int64
def split_cat(text): try: return text.split("/") except: return ("No Label", "No Label", "No Label") train['general_cat'], train['subcat_1'], train['subcat_2'] = \ zip(*train['category_name'].apply(lambda x: split_cat(x))) train.head()
print("There are %d unique first sub-categories." % train['subcat_1'].nunique())
There are 114 unique first sub-categories.
print("There are %d unique second sub-categories." % train['subcat_2'].nunique())
There are 871 unique second sub-categories.
x = train['general_cat'].value_counts().index.values.astype('str') y = train['general_cat'].value_counts().values pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))] trace1 = go.Bar(x=x, y=y, text=pct) layout = dict(title= 'Number of Items by Main Category', yaxis = dict(title='Count'), xaxis = dict(title='Category')) fig=dict(data=[trace1], layout=layout) py.iplot(fig)
x = train['subcat_1'].value_counts().index.values.astype('str')[:15] y = train['subcat_1'].value_counts().values[:15] pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]
NLP前処理:
Tokenization
stop = set(stopwords.words('english')) def tokenize(text): """ sent_tokenize(): segment text into sentences word_tokenize(): break sentences into words """ try: regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]') text = regex.sub(" ", text) # remove punctuation tokens_ = [word_tokenize(s) for s in sent_tokenize(text)] tokens = [] for token_by_sent in tokens_: tokens += token_by_sent tokens = list(filter(lambda t: t.lower() not in stop, tokens)) filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)] filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3] return filtered_tokens except TypeError as e: print(text,e) # tokenizer 処理 train['tokens'] = train['item_description'].map(tokenize) test['tokens'] = test['item_description'].map(tokenize) for description, tokens in zip(train['item_description'].head(), train['tokens'].head()): print('description:', description) print('tokens:', tokens) print()
description: No description yet
tokens: [‘description’, ‘yet’]
description: This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
tokens: [‘keyboard’, ‘great’, ‘condition’, ‘works’, ‘like’, ‘came’, ‘box’, ‘ports’, ‘tested’, ‘work’, ‘perfectly’, ‘lights’, ‘customizable’, ‘via’, ‘razer’, ‘synapse’, ‘app’]
description: Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!
tokens: [‘adorable’, ‘top’, ‘hint’, ‘lace’, ‘key’, ‘hole’, ‘back’, ‘pale’, ‘pink’, ‘also’, ‘available’, ‘white’]
description: New with tags. Leather horses. Retail for [rm] each. Stand about a foot high. They are being sold as a pair. Any questions please ask. Free shipping. Just got out of storage
tokens: [‘new’, ‘tags’, ‘leather’, ‘horses’, ‘retail’, ‘stand’, ‘foot’, ‘high’, ‘sold’, ‘pair’, ‘questions’, ‘please’, ‘ask’, ‘free’, ‘shipping’, ‘got’, ‘storage’]
description: Complete with certificate of authenticity
tokens: [‘complete’, ‘certificate’, ‘authenticity’]
WordCloud作成
cat_desc = dict() for cat in general_cats: text = " ".join(train.loc[train['general_cat']==cat, 'item_description'].values) cat_desc[cat] = tokenize(text) # find the most common words for the top 4 categories women100 = Counter(cat_desc['Women']).most_common(100) beauty100 = Counter(cat_desc['Beauty']).most_common(100) kids100 = Counter(cat_desc['Kids']).most_common(100) electronics100 = Counter(cat_desc['Electronics']).most_common(100) def generate_wordcloud(tup): wordcloud = WordCloud(background_color='white', max_words=50, max_font_size=40, random_state=42 ).generate(str(tup)) return wordcloud