[White Paper] Deep Consumer Insights from Social Data - PropheSee

Please enter a valid Email Address

[White Paper] Deep Consumer Insights from Social Data

Download the white paper here.

With the evolution of social and digital media in the last decade, the way brands & consumers communicate & share information has changed drastically. The passive activity of brands “speaking to” their customers has evolved into an active brand – consumer engagement, one that is much more dual in nature than most imagined. Apart from the changing marketing landscape, the social media realm is ripe with conversations, opinions & feedback loops that could offer brands insights that lead to exponentially higher ROIs, reduced research time & increased capital efficiency. The questions, that remain however, are primarily along the lines of what data is authentic & relevant & how one can access & efficiently analyze everything.

Traditionally speaking, many rely on surveys, polls or manual data clean up activities & analysis to pull useful consumer insights. Using PropheSee’s Consumer Insights Engine, this white paper brings forth consumer & audience insights through a deep analysis of social data.

This paper aims to illustrate how brands & businesses can leverage from an automated consumer insights engine & the advantages of iterative research strategies, rather than stop-review-go strategies. Adopting such an approach leads to exponentially increased time & capital efficiency as well as a significantly reduced margin of error.

This post is about the methodology used in an automated approach & why such an analysis is importantif you’re more interested in the paper itself, feel free to just download it here. If you’re interested in the technicalities, logic & methodologies of how an automated consumer insights engine works, please read on.


Consumer insights provide a deep grasp of a brand’s audience (customers, fans or consumers), that originate from a thoroughgoing analysis of consumer data. The data includes their buying behaviour, queries, sentiments, verbatim and social affinities.

First, the engine used over 20,000 brands and their audiences, tracked by PropheSee across several social media networks. The Image below describes the overall architecture of automated engine and its components.


The complete consumer insights engine is an end to end automated pipeline with several components.

The first component (A) is the data extraction module, composed of different data connectors for different data sources – Twitter, Facebook, Instagram, YouTube, Websites, news, feeds and offline data etc. The module consists of several REST and streaming APIs and web mining packages. The module runs actively on cloud servers, streams the data in real time and stores it in the databases.

The next component is the data preprocessing layer (B) where all of the extracted data is cleaned and structured. The cleaning involves text cleaning, quantitative cleaning and miscellaneous noise handling (removal of spam accounts, mentions etc. takes place here). In this component, there is another layer of data de-duplication that ensures the uniqueness of data points, irrespective of the data sources.

Component (C) is the data integration layer where user conversations data is linked with user meta data – demographics, interests etc. The complete data at this step is standardized, formatted and pushed to elasticsearch.

Components (D) and (E) are the most important components of the complete consumer insights engine. All the machine learning, natural language processing, and information retrieval modules run here. First, different industry-wide classifiers are trained on different types of data. Since the majority of the data is text, convolutional neural networks are used in majority along with others ( xg-boost and linear SVM).

This component is further split into several ML components – data cleansing, feature engineering, feature selection, training and tuning – all packed in the form of a sequential pipeline. The classification includes – noise classification, sentiment classification, and industry wise theme categorization. All of this tagged data is then consumed by the natural language processing engine, where concepts and entities are identified. The NLP module consists of dependency grammar, part of speech tagging, regular expressions, topic modeling, and n-gram analysis. For Example –


“I really like the blue color of this camera, can you please share the price”

Themes: { “theme1” : “Color” }, { “theme2” : “product” }, { “theme3” : “prices” }

Sentiment: { “theme1” : “positive” }, { “theme2” : “neutral” }, { “theme3” : “neutral” }

Conversation Type: { “isNoise” : “No” } , { “type1” : “question”} , { “type2” : “opinion”}

Entities: { “entitiy1” : “blue color”, “entity2” : “camera”, “entity3” : “price”}

This data is used by the insights generation layer (component F, G, and H) which tries to make sense of data and provides fruitful results. This layer consists of four sub modules

  • Correlation engine,
  • Forecasting engine,
  • Real-time trends and
  • Data aggregation engine

The correlation engine tries to find out if there are any significant correlations that exist among the data points. Example – consumer queries about products increases by 15 % on weekends and get jumps by 9% when brand posts about a new feature.

Forecasting engine is used to predict the consumer behavior such as likely increase in consumer queries before holiday season will be about “places to visit” in Travel industry, “deals and discounts” in E-Commerce industry etc.

Real-time trends engine links the information resulting from the analysis and the news/events/buzz happening in real time. An aggregation engine is used to figure out which entities, concepts, themes tops the conversations. All the statistical central tendencies are calculated by the aggregation engine. This information is finally pushed to dashboards using rest APIs and quick auto-generated reports.

An example of some insights for different industries –

Travel: “Popular Cities”, “Natural Beauty” and “Family Vacations” are the most talked about travel themes among the consumers with 39.8%, 36%, and 16% mindshare. Also associated with the highest negative sentiment. Top consumer queries – “places to visit during night”, “hotel offers for families” etc.

Food Ordering: Payment Gateways garnered the highest negative sentiment particularly for mobile apps/mobile app based ordering platforms. “payment failure”, “no returns”, “payment options”, “incomplete payments” are the top mentions. Conversations are 18% higher during holidays, 65% of them about offers.

Makeup Brands: “red” (1,078 mentions), “pink” (450) and “black” (306) are most talked about colors by consumers. “Red” is mostly used with keywords – “love”, “favorite”, “dream”, “best” leading to highest positive sentiment. Top consumer queries – “is the red color available in Pune stores”, “any offer for new users”, “is this a lip balm or a lipstick”

Retail: Gifting appeared as one of the top theme, birthday gifts, diwali gifts, gifts for brothers and dads are the top consumer conversations. Perfumes are the top gift choices among the consumers followed by Jeans and Tops. Top consumer queries – “can i order online”, “alternate colors available”.

The consumer insights engine is scalable, robust and can be extended to any industry. While it is capable of much more than what has been discussed here, we wanted to keep the white paper clear & focussed. We analyzed the social data of last 6 months for several brands and different industries, check out the white paper here. Feel free to share your feedback and thoughts in the comments or drop me a line via LinkedIn Inbox or email (ishaan@prophesee.in).

Facebook Comments
Share this:

Leave a Comment